CN106056160B - User fault reporting prediction method under unbalanced IPTV data set - Google Patents

User fault reporting prediction method under unbalanced IPTV data set Download PDF

Info

Publication number
CN106056160B
CN106056160B CN201610392603.XA CN201610392603A CN106056160B CN 106056160 B CN106056160 B CN 106056160B CN 201610392603 A CN201610392603 A CN 201610392603A CN 106056160 B CN106056160 B CN 106056160B
Authority
CN
China
Prior art keywords
user
reporting
fault
odr
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610392603.XA
Other languages
Chinese (zh)
Other versions
CN106056160A (en
Inventor
周亮
吴志峰
黄若尘
魏昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201610392603.XA priority Critical patent/CN106056160B/en
Publication of CN106056160A publication Critical patent/CN106056160A/en
Application granted granted Critical
Publication of CN106056160B publication Critical patent/CN106056160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a user fault reporting prediction method under an unbalanced IPTV data set, which mainly comprises the following steps: (1) importing an IPTV user viewing record and extracting a numerical index; (2) averaging the viewing records of each user; (3) initializing a balance value beta; (4) deleting the non-fault-reporting samples by adopting ODR and BSMOTE algorithms both based on the Mahalanobis distance, and adding manual fault-reporting samples; (5) deleting newly added samples having negative influence on classification by using a TOMEK algorithm; (6) putting the reconstructed sample data set into an SVM classifier with self-adaptive variable kernel width for training; (7) and inputting IPTV user data to be predicted into a detector of the trained SVM. Because the improved BSMOTE and ODR algorithms are based on the Mahalanobis distance, the information overlapping caused by multiple correlations of variables is avoided, the influence of different dimensions among sample point attributes is avoided, a better sample data transformation effect is obtained, the interference of noise points and redundant points on fault reporting prediction is weakened, and the prediction accuracy of the classifier is greatly improved.

Description

User fault reporting prediction method under unbalanced IPTV data set
Technical Field
The invention belongs to the technical field of IPTV data analysis and processing, and particularly relates to a user fault report prediction method under an unbalanced IPTV data set.
Background
With the rapid development of multimedia communication technology, the internet Protocol television (iptv) based on broadband internet, i.e. the interactive network television, greatly facilitates the common residents to enjoy interactive, personalized, freely customized video services and value-added application services at home. However, in the video transmission process, when the Quality of Service (QoS) of the conventional network deteriorates, such as bandwidth, packet loss, delay and jitter, the viewing experience of the user may be affected to some extent, thereby causing complaint and failure report of the user. The proportion of the fault-reporting users to the whole users is small, user data inevitably becomes an unbalanced data set, and the unbalanced proportion continuously increases with the increasing maturity of the IPTV technology.
Predicting whether a user is disabled or not is a typical binary classification problem. Traditional sophisticated algorithms for dealing with this problem include Support Vector Machines (SVMs), but the classification performance of SVMs decreases as the degree of data imbalance increases. Therefore, after the unbalanced data is converted into an equalized data set through an algorithm on a data level, the equalized data is classified through an SVM classifier. While algorithms on the conventional data level often process data by using a euclidean-distance-based Oversampling BSMOTE (boundary-Synthetic minimum Oversampling Technique) algorithm or by using an undersampling odr (optimization of refining reduction) algorithm based on euclidean distance. Although these algorithms can improve the prediction accuracy, it is inevitable that information is overlapped due to multiple correlations of emphasis variables, and it is also inevitable that the generated artificial sample points are not noise points.
The domestic invention with the authorization number of CN102254177B and the name of 'an unbalanced data SVM bearing fault detection method' provides an unbalanced data SVM bearing fault detection method, and the defects are that (1) the adoption of Euclidean distance in the algorithm is easily influenced by different dimensions among sample point attributes; (2) for impurity artificial sample points generated by the BSMOTE algorithm, an effective removal method is lacked; (3) the advantage of kernel width in the SVM algorithm for improving classification accuracy is not fully exploited.
Disclosure of Invention
The invention aims to overcome the defects that the algorithm in the prior art is easily influenced by different dimensions among sample point attributes, an effective removing method for impurity artificial sample points is lacked, the advantage of kernel width in an SVM algorithm on improving classification accuracy can not be fully developed, and the like.
Therefore, the invention provides a user fault reporting prediction method under an unbalanced IPTV data set. The method comprises the following steps:
step 1: importing an IPTV user viewing record, wherein the record contains information such as user id, indexes, fault reporting time and the like, only extracting numerical indexes in the IPTV user viewing record, and expressing variables as z;
setting the total number of imported IPTV users as N and the total recorded number as D, wherein the fault-reporting user has N1The user without fault has N2The nth user includes Dn(N ═ 1., N) notationAnd (5) recording. The numerical index dimensions are Q, and z represents the numerical index variable and is respectively z1,z2,...,zQEach index zqValue of
Figure BDA0001010080160000021
Step 2: get averaged records g for each usern(N ═ 1.., N) is specifically as follows:
calculating the respective mean value of Q indexes of the nth user
Figure BDA0001010080160000022
Figure BDA0001010080160000023
I.e. each user leaves only one record after preprocessing
Figure BDA0001010080160000024
And is set to be N1The data set composed of a few fault-reporting users is
Figure BDA0001010080160000025
From N2The data set composed of a plurality of users with no fault report is
Figure BDA0001010080160000026
The data set composed of the total users is G ═ Gmin∪Gmaj
And step 3: initializing a balance value beta based on the Mahalanobis distance ODR algorithm;
and determining the balance value beta, wherein if the value of the balance value beta is too small, the reduction effect on most types is not obvious, otherwise, if the value of the balance value beta is too large, most types of valuable samples are likely to be deleted by mistake, and the value range of the balance value beta is more than or equal to 0.2 and less than or equal to 0.5.
And 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distancebsmoteAnd determining the balance value alpha of the BSMOTE algorithm. Followed by the use ofOdr algorithm for mahalanobis distance reduction of non-fault-reporting user sample set YodrTo achieve a balanced data set Gsmote+odr
(4-1) determining an increased artificial failure reporting user sample set Y by using Mahalanobis distance-based BSMOTEbsmote
(4-1-1) calculating user data g for each error reporti∈GminWith other user data gj∈G(gj≠gi) Mahalanobis distance d (g) therebetweeni,gj)。
Figure BDA0001010080160000031
Therein, sigma-1The covariance matrix for the total user data set G.
(4-1-2) according to d (g)i,gj) And adopting K-Nearest Neighbor (K-NN) algorithm to carry out the comparison on the nth (N is 11) Determining a series of nearest neighbor sample sets G of the fault reporting usern-KNNAnd determines the sample set to which it belongs.
Determining odd number K in K-NN algorithm1And judging the number of the failure reporting users in the nearest neighbor sample set.
If it satisfies
Figure BDA0001010080160000032
Then the failure reporting user sample is divided into Border sample set GBorderIn (1).
If | Gn-KNN∩GmajIf | ═ phi, then divide the sample of the fault-reporting user into Safe sample set GSafeIn (1).
If | Gn-KNN∩Gmaj|=K1Dividing the sample of the fault-reporting user into a Noise sample set GNoiseIn (1).
Wherein Gn-KNNK around sample point of nth failed user1A sample point is provided with
Figure BDA0001010080160000033
(4-1-3) statistics of gp(gp∈GBorder) At GminRandom K in (1)2Neighbor sample set
Figure BDA0001010080160000034
And calculate gpAnd
Figure BDA0001010080160000035
property difference h ofpk
Statistics GBorder={g1,..,gp,...,gPEach failed user sample inpIn sample set GminRandom K in (1)2A nearest neighbor
Figure BDA0001010080160000036
Where P is the total number in the Border sample set. Calculate sample gpkAnd the sample g of the user reporting the faultpAll the attribute differences h betweenpk
hpk=gp-gpk,(p=1,...,P;k=1,...,K2)
(4-1-4) to gp(gp∈GBorder) All generate an artificial failure report sample set Yp
If g ispk∈GNoiseOr gpk∈GSafeThen hpkBy a random number rpkE (0, 0.5). If g ispk∈GBorderThen hpkBy a random number rpkE (0,1), then for each gpGenerated artificial sample ypk
ypk=gk+|rpk×hpk|,(p=1,...,P;k=1,...,K2)
And the finally generated manual fault reporting user sample set comprises the following steps:
Figure BDA0001010080160000037
(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating GBorderNew addition of each fault-reporting userSample set Yp(P ═ 1.. P.), the equilibrium value α of the BSMOTE algorithm is determined until Y is generatedbsmote={Y1,...,YPThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N2-N1
Wherein the equilibrium value alpha is greater than or equal to
Figure BDA0001010080160000041
The smallest integer value of (c).
(4-2) determining a reduced non-reporting user sample set Y using Mahalanobis distance-based ODRodr
(4-2-1) calculating per-failure-reporting-user data gm(gm∈Gmaj) User data g without reporting faultl(gl∈G;gl≠gm) Mahalanobis distance d (g) therebetweenm,gl)。
(4-2-2) according to d (g)m,gl) Calculate GmajIn each sample gmIs associated with
Figure BDA0001010080160000042
Define Association set CmMeans GmajRemoving g frommK of other samples3Each nearest neighbor contains gmThe sample set of (2). By gmnRepresents the sample point gn(gn∈Gmaj) K of3Each nearest neighbor contains gmSample points, then a number of gmnSet of composed samples CmIs gmAn associative set of sample points.
(4-2-3) according to the presence or absence of gmFor gmn(gmn∈Cm) K of4-NN algorithm to determine the impact of accuracy, on gmAnd (6) classifying.
Determination of odd number K4. Calculate gmWhen, K4-NN algorithm pair gmn(gmn∈Cm) Number Num of correct classificationp. Recalculate no gmWhen, K4-NN pair gmn(gmn∈Cm) Correctly classifiedNumber Numno-pComparison NumpAnd Numno-pSize, g is determined according to the following criteriamAnd (4) classification:
satisfy Nump≤Numno-pDescription of (1)mNegatively affected and divided into Noise sample set SNoiseIn (1).
Satisfy Nump=Numno-pDescription of (1)mOptionally, dividing into Safe sample set SSafeIn (1).
Satisfy Nump≥Numno-pDescription of (1)mUseful and divided into Save sample set SSaveIn (1).
(4-2-4) preferential deletion of SNoiseThen delete SSafeUntil the failure-free sample set meets the condition, finally outputting all processed data sets Gsmote+odr
Definition of YodrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from SNoiseThen is SSafe. Deleted YodrThe total number is more than or equal to beta N2I.e. processed failure-free sample set { Gmaj-YodrThe total number of (1-. beta.) N is less than or equal to2
After the Mahalanobis ODR and BSMOTE algorithms, the entire data set Gsmote+odrComprises the following steps:
Gsmote+odr={Gmaj-Yodr}+{Gmin+Ybsmote}
and 5: data set G using the TOMEK algorithmsmote+odrCleaning the data to obtain cleaned data Gsmote+odr+tomek
(5-1) initialization Gsmote+odr+tomekAnd (4) collecting.
(5-2) random slave Gsmote+odrIn which a sample point g is extractediAnd in Gsmote+odrFind the nearest neighbor point gj(gj≠gi)。
(5-3) at Gsmote+odrMiddle search and gjNearest neighbor point gk(gk≠gj)。
(5-4) determination of gi==gkIf yes, continue to execute (5-5), otherwise make gi=gj,gj=gkAnd then jumps to step (5-3).
(5-5) determination of giAnd gkWhether the corresponding user categories (failure or non-failure) are consistent. If the two sample points are consistent, the two sample points are saved to a sample set Gsmote+odr+tomekThen from Gsmote+odrDeleting the two sample points. If the categories are not consistent, directly starting from Gsmote+odrDeleting the two sample points.
(5-6) judgment of sample set Gsmote+odrIs an even number greater than 0. If the number is even, repeating the step (5-2), otherwise ending the exit.
Step 6: g is to besmote+odr+tomekThe data in the step (a) is brought into an SVM classifier for training, the kernel width sigma of the SVM classifier is adaptively adjusted by combining the step length of the thickness and the step length, an approximate optimal global point is searched, and the corresponding sigma is determinedoptimal
(6-1) determining the Kernel function of the SVM classifier as a Gaussian Kernel function
Figure BDA0001010080160000061
Figure BDA0001010080160000062
Wherein g isx∈Gsmote+odr+tomek
Figure BDA0001010080160000063
Is gxσ is the gaussian kernel width.
(6-2) determining geometric mean values G-mean and F-measure of the accurate evaluation criterion of the model:
confusion matrix from classified sample sets
The mathematical expressions of the user fault reporting Recall rate Recall _ Min, the user fault reporting Precision _ Min, the user non-fault reporting Recall rate Recall _ Maj, the geometric mean value G-mean and the F-mean are respectively as follows:
Figure BDA0001010080160000064
Figure BDA0001010080160000065
Figure BDA0001010080160000066
the G-mean is to maximize the precision of the classification under the condition of keeping balance of user fault reporting and user non-fault reporting precision, namely, the value of the G-mean is maximum only when Recall _ Min and Recall _ Maj are simultaneously high. The F-measure index is a classification evaluation index comprehensively considering recall ratio and precision ratio. The F-measure can comprehensively embody the classification effect of the classifier on the user fault reporting and the user fault non-reporting, but focuses more on the classification effect of the user fault reporting sample.
(6-3) initializing penalty factor C, kernel width sigma and kernel width maximum sigma of SVM classifiermaxAnd (5) coarse step length, and then entering SVM classifier operation to obtain the optimal local points of G-mean and F-measure.
Changing sigma by coarse step length, and after obtaining a better SVM classification result each time, refining the optimal local point until sigma is more than sigmamaxAnd then the process is finished. At this time, the best local point among them is selected.
(6-4) adaptively changing the kernel width sigma in fine steps from the left side of the optimal local point, and when G-mean and F-measure become the approximate optimal global point, obtaining the corresponding approximate optimal kernel width sigmaoptimalAnd outputting the classification result.
And 7: inputting IPTV user data to be predicted into a detector of the trained SVM, predicting whether the user reports the fault or not, and realizing early warning on the IPTV fault-reporting user.
Further, in the step 1, the numerical indicator is extracted from the record including the user id, the indicator, and the failure report time information.
Furthermore, in the step 3, the value range of the equilibrium value beta is more than or equal to 0.2 and less than or equal to 0.5.
Further, the confusion matrix of the classification sample set in step 6-2 is:
Figure BDA0001010080160000071
compared with the prior art, the invention has the beneficial effects that:
the improved BSMOTE and ODR algorithms adopted in the invention are both based on the Mahalanobis distance, so that not only is information overlapping caused by multiple correlations of variables avoided, but also the influence of different dimensions among sample point attributes is avoided, and a better sample data transformation effect is obtained.
The BSMOTE, ODR algorithm and the data cleaning TOMEK algorithm adopted in the invention weaken the interference of noise points and redundant points on fault prediction on one hand, and strengthen the contribution of a few effective sample points on correct classification on the other hand. Meanwhile, impurity points which are generated by the BSMOTE algorithm and are difficult to distinguish and judge on SVM classification boundaries are eliminated, and the prediction accuracy of the classifier is greatly improved.
The method adopts the coarse step length and the fine step length combined with an algorithm for adaptively adjusting the kernel width sigma of the SVM classifier, can obviously improve the accuracy of prediction at the cost of small sigma precision loss, and simultaneously can ensure that the algorithm has high operation efficiency.
Drawings
Fig. 1 is a flowchart of a user failure prediction method under an unbalanced IPTV data set according to the present invention.
Fig. 2 is a flowchart of the adaptive variable kernel width SVM according to the present invention.
Fig. 3 is a diagram illustrating the failure reporting prediction results of the standard SVM and the conventional ODR-BSMOTE-SVM according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a failure prediction result of the improved algorithm according to the embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
In order to better illustrate the user fault reporting prediction method under the unbalanced IPTV data set, the method is applied to the early warning of IPTV fault reporting. The training and testing data used in the present invention was derived from data of IPTV users in the whole telecommunications province of Jiangsu, where 439050 user data relate to 4723101 viewing records, including 4871 reporting users and 48172 viewing records. In addition, there were 434179 non-reporting users, involving 4674929 viewing records, with an imbalance ratio of up to 1:89 for both the minority and majority classes. The numeric index dimension for each user viewing the record is 10, while taking K in this example1=K3=K4=5,K2The initial balance β is 0.3, and the penalty factor C is 1000.
According to the flow of the inventive concept (shown in fig. 1), the disabled user prediction is started.
Step 1: importing an IPTV user viewing record, and recording information including a user id, an index, fault reporting time and the like, wherein only the numerical index is extracted, and the variable is represented as z;
in this example, the total number of imported IPTV users N is 439050, and the total number of records D is 4723101, where the failed users have N14871, the non-disabled user has N2434179, the nth user contains Dn(n-1.., 439050) records, each of the numerical index dimensions Q-10, and the numerical index variables are represented by z, each z being a value of z1,z2,...,z10Each index zqValue of
Figure BDA0001010080160000081
Step 2: get averaged records g for each usern(N ═ 1.., N) is specifically as follows:
calculating the respective mean value of Q indexes of the nth user
Figure BDA0001010080160000082
Figure BDA0001010080160000083
Each user is preprocessed to leave only one user
Figure BDA0001010080160000084
Recording, and setting from N1The data set formed by 4871 minority fault-reporting users is
Figure BDA0001010080160000085
From N2434179 data sets composed of most users who do not report faults
Figure BDA0001010080160000086
The data set composed of the total users is G ═ Gmin∪Gmaj
And step 3: initializing a balance value beta of 0.3 based on the Mahalanobis distance ODR algorithm;
and 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distancebsmoteAnd determining the balance value alpha of the BSMOTE algorithm. Then, reducing the sample set Y of the users without fault reporting by adopting the Mahalanobis distance-based ODR algorithmodrTo achieve a balanced data set Gsmote+odr
(4-1) determining an increased artificial failure reporting user sample set Y by using Mahalanobis distance-based BSMOTEbsmote
(4-1-1) calculating user data g for each error reporti∈GminWith other user data gj∈G(gj≠gi) Mahalanobis distance d (g) therebetweeni,gj)。
Figure BDA0001010080160000091
Therein, sigma-1The covariance matrix for the total user data set G.
(4-1-2) according to d (g)i,gj) And adopting K-Nearest Neighbor (K-NN) algorithm to carry out the comparison on the nth (N is 11) A fault-reporting user determines one of themSeries nearest neighbor sample set Gn-KNNAnd determines the sample set to which it belongs.
Determining odd number K in K-NN algorithm1And (5) judging the number of the failure reporting users in the nearest neighbor sample set.
If it satisfies
Figure BDA0001010080160000092
Then the failure reporting user sample is divided into Border sample set GBorderIn (1).
| Gn-KNN∩Gmaj | ═ phi, the failure reporting user sample is divided into Safe sample set GSafeIn (1).
| Gn-KNN∩Gmaj| 5, the failing user sample is divided into Noise sample set GNoiseIn (1).
(4-1-3) statistics of gp(gp∈GBorder) At GminRandom K in (1)23 neighbor sample set
Figure BDA0001010080160000093
And calculate gpAnd
Figure BDA0001010080160000094
property difference h ofpk
Statistics GBorder={g1,..,gp,...,gPEach failed user sample inpIn sample set GminRandom K in (1)2A nearest neighbor
Figure BDA0001010080160000095
Calculate sample gpkAnd the sample g of the user reporting the faultpAll the attribute differences h betweenpk
hpk=gp-gpk,(p=1,...,P;k=1,...,K2)
(4-1-4) for gp(gp∈GBorder) All generate an artificial failure report sample set Yp
If g ispk∈GNoiseOr gpk∈GSafeThen hpkBy a random number rpkE (0, 0.5). If g ispk∈GBorderThen hpkBy a random number rpkE (0,1), then for each gpGenerated artificial sample ypk
ypk=gk+|rpk×hpk|,(p=1,...,P;k=1,...,K2)
And the finally generated manual fault reporting user sample set comprises the following steps:
Figure BDA0001010080160000101
(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating GBorderNewly added sample set Y of each fault-reporting userp(P1.. P.) the equilibrium value a of the BSMOTE algorithm is determined until Y is generatedbsmote={Y1,...,YPThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N2-N1=299054。
Wherein the equilibrium value alpha is greater than or equal to
Figure BDA0001010080160000102
The smallest integer value of (c).
(4-2) determining a reduced non-reporting user sample set Y using Mahalanobis distance-based ODRodr
(4-2-1) calculating the per-failure-reporting-user data gm(gm∈Gmaj) User data g without reporting faultl(gl∈G;gl≠gm) Mahalanobis distance d (g) therebetweenm,gl)。
(4-2-2) according to d (g)m,gl) Calculate GmajIn each sample gmIs associated with
Figure BDA0001010080160000103
(4-2-3) according to the presence or absence of gmFor g tomn(gmn∈Cm) K of4-NN algorithm to determine the impact of accuracy, on gmAnd (6) classifying.
Determination of odd number K45. Calculate gmWhen, K4-NN algorithm pair gmn(gmn∈Cm) Number Num of correct classificationp. Recalculate no gmWhen, K4-NN pair gmn(gmn∈Cm) Number Num of correct classificationno-pComparison NumpAnd Numno-pSize, g is determined according to the following criteriamAnd (4) classification:
satisfy Nump≤Numno-pDescription of (1)mNegatively affected and divided into Noise sample set SNoiseIn (1).
Satisfy Nump=Numno-pDescription of (1)mOptionally, dividing into Safe sample set SSafeIn (1).
Satisfy Nump≥Numno-pDescription of (1)mUseful and divided into Save sample set SSaveIn (1).
(4-2-4) preferential deletion of SNoiseThen delete SSafeUntil the failure-free sample set meets the condition, finally outputting all processed data sets Gsmote+odr
Definition of YodrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from SNoiseThen is SSafe. Deleted YodrThe total number is more than or equal to beta N2130254, i.e. processed failure-free sample set Gmaj-YodrThe total number of (1-. beta.) N is less than or equal to2=303925。
After the Mahalanobis ODR and BSMOTE algorithms, the entire data set Gsmote+odrComprises the following steps:
Gsmote+odr={Gmaj-Yodr}+{Gmin+Ybsmote}
and 5: data set G using the TOMEK algorithmsmote+odrCleaning the data to obtain cleaned data Gsmote+odr+tomek
(5-1) initialChemical formula Gsmote+odr+tomekAnd (4) collecting.
(5-2) random slave Gsmote+odrIn which a sample point g is extractediAnd is in Gsmote+odrFind the nearest neighbor point gj(gj≠gi)。
(5-3) at Gsmote+odrMiddle search and gjNearest neighbor point gk(gk≠gj)。
(5-4) determination of gi==gkIf yes, continue to execute (5-5), otherwise make gi=gj,gj=gkAnd then jumps to step (5-3).
(5-5) determination of giAnd gkWhether the corresponding user categories (failure or non-failure) are consistent. If the two sample points are consistent, the two sample points are saved to a sample set Gsmote+odr+tomekThen from Gsmote+odrDeleting the two sample points. If the categories are not consistent, directly starting from Gsmote+odrDeleting the two sample points.
(5-6) judgment of sample set Gsmote+odrIs an even number greater than 0. If the number is even, repeating the step (5-2), otherwise ending the exit.
Step 6: g is to besmote+odr+tomekThe data in the step (a) is brought into an SVM classifier for training, the kernel width sigma of the SVM classifier is adaptively adjusted by combining the step length of the thickness and the step length, an approximate optimal global point is searched, and the corresponding sigma is determinedoptimal
(6-1) determining the Kernel function of the SVM classifier as a Gaussian Kernel function
Figure BDA0001010080160000121
Figure BDA0001010080160000122
(6-2) determining the geometric mean values G-mean and F-measure of the accurate evaluation criteria of the model.
(6-3) initializing penalty factor C, kernel width sigma and kernel width maximum sigma of SVM classifiermaxAnd (5) coarse step length, and then entering SVM classifier operation to obtain the optimal local points of G-mean and F-measure.
The penalty factor C of the initialized SVM classifier is 1000, the kernel width sigma is 0.1, and sigma ismaxThe coarse step size is 0.1, 2. And changing sigma by using the coarse step length, updating the optimal local point after obtaining a better SVM classification result each time until the sigma is more than 2, and obtaining the optimal local points of G-mean and F-measure. .
(6-4) adaptively changing the kernel width sigma in fine steps from the left side of the optimal local point, and when G-mean and F-measure become the approximate optimal global point, obtaining the corresponding approximate optimal kernel width sigmaoptimalAnd outputting the classification result.
As shown in the flowchart of fig. 2, after the fine step size is determined to be 0.01, σ is changed from the left side where the optimal local point σ obtained by the method of the present invention is 0.2, and finally the approximate optimal global point and the corresponding σ are obtainedoptimal=0.21。
And 7: inputting IPTV user data to be predicted into a detector of the trained SVM, predicting whether the user reports the fault or not, and realizing early warning on the IPTV fault-reporting user.
Evaluation of Performance
The result obtained by adopting the prediction method provided by the invention is compared with the correct classification result, so that the effectiveness and the accuracy of the method provided by the invention can be evaluated and measured. It can be seen from (a) and (b) of fig. 3 that the optimum point obtained by the standard SVM algorithm will be around the kernel width σ of 0.3. The recall rate of fault reporting and fault non-reporting is about 65%, but the values of G-mean and F-measure are generally low and are below 0.1. From (c) and (d) of FIG. 3, it can be seen that the classification effect of the conventional ODR-BSMOTE-SVM algorithm is improved compared with the standard SVM, and the Gaussian kernel width σ is 0.2 or less, so that better G-mean and F-mean can be obtained. It can be seen from (a) and (b) of FIG. 4 that the classification effect of the method of the present invention is significantly better than that of the first two algorithms, and the Gaussian kernel width σ is 0.2 or less, so that good G-mean and F-mean can be obtained. It can be seen from (a) and (b) of fig. 4 that the method of the present invention can determine that the kernel width σ is 0.21 to obtain the near-optimal classification effect after the fine step size. The standard SVM, the traditional ODR-BSMOTE-SVM and the user fault reporting recall rate measured by the method of the invention are as follows in sequence: 64.0%, 71.7% and 92.6%, and the user non-obstacle recall rates are respectively as follows: 69.04%, 71.78%, 93.08%. Therefore, better prediction performance can be obtained by adopting the method.
It should be understood that the above description is not intended to limit the present invention, and any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A user failure reporting prediction method under an unbalanced IPTV data set is characterized by comprising the following steps:
step 1: importing an IPTV user viewing record, and extracting a numerical index, wherein the variable is represented as z:
setting the total number of imported IPTV users as N and the total recorded number as D, wherein the fault-reporting user has N1The user without fault has N2The nth user includes DnA bar record, N ═ 1.., N; the dimensionality of the recorded numerical indexes is Q, and z represents a numerical index variable and is respectively z1,z2,…,zq,…,zQEach index zqValue of
Figure FDF0000015077800000011
Step 2: get averaged records g for each usernN is 1, …, and N is as follows:
calculating the respective mean value of Q indexes of the nth user
Figure FDF0000015077800000012
Figure FDF0000015077800000013
I.e. each user leaves only one record after preprocessing
Figure FDF0000015077800000014
And is set to be N1The data set composed of a few fault-reporting users is
Figure FDF0000015077800000015
From N2The data set composed of a plurality of users with no fault report is
Figure FDF0000015077800000016
The data set composed of the total users is G ═ Gmin∪Gmaj
And step 3: initializing a balance value beta based on the Mahalanobis distance ODR algorithm;
and 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distancebsmoteDetermining a balance value alpha of the BSMOTE algorithm; then, reducing the sample set Y of the users without fault reporting by adopting the Mahalanobis distance-based ODR algorithmodrTo achieve a balanced data set Gsmote+odr
(4-1) determining an increased artificial failure reporting user sample set Y by using Mahalanobis distance-based BSMOTEbsmote
(4-1-1) calculating user data g for each error reporti∈GminWith other user data gjE.g. the mahalanobis distance d (G) between Gi,gj) Wherein g isi≠gj
Figure FDF0000015077800000017
Therein, sigma-1A covariance matrix for the total user data set G;
(4-1-2) according to d (g)i,gj) And determining a series of nearest neighbor sample sets G of the nth fault reporting user by adopting a K-NN algorithmn-KNNAnd determining the sample set, wherein N is 1, … N1
Determining odd number K in K-NN algorithm1Value, judge and report failureThe number of the nearest neighbor sample sets of the user which belong to the non-fault reporting number;
if it satisfies
Figure FDF0000015077800000018
Then the failure reporting user sample is divided into Border sample set GBorderPerforming the following steps;
if | Gn-KNN∩GmajIf | ═ phi, then divide the sample of the fault-reporting user into Safe sample set GSafePerforming the following steps;
if | Gn-KNN∩Gmaj|=K1Dividing the sample of the fault-reporting user into a Noise sample set GNoisePerforming the following steps;
(4-1-3) statistics of gpAt GminRandom K in (1)2Neighbor sample set Gmin-K2NNAnd calculate gpAnd gpkProperty difference h ofpkWherein g isp∈GBorder
Figure FDF0000015077800000021
Statistics GBorder={g1,..,gp,...,gPEach failed user sample inpIn sample set GminRandom K in (1)2A nearest neighbor
Figure FDF0000015077800000022
Wherein P is the total number in the Border sample set; calculate sample gpkAnd the sample g of the user reporting the faultpAll the attribute differences h betweenpk
hpk=gp-gpkWherein P is 1, …, P; k1, … K2
(4-1-4) for gpAll generate an artificial fault report sample set Yp,gp∈GBorder,:
If g ispk∈GNoiseOr gpk∈GSafeThen hpkBy a random number rpkE (0, 0.5); if g ispk∈GBorderThen hpkBy a random number rpkE (0,1), then for each gpGenerated artificial sample ypk
ypk=gp+|rpk×hpk|,p=1,…,P;k=1,…,K2
And the finally generated manual fault reporting user sample set comprises the following steps:
Figure FDF0000015077800000023
(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating GBorderNewly added sample set Y of each fault-reporting userpP1, … P, determining the equilibrium value α of the BSMOTE algorithm until Y is generatedbsmote={Y1,...,YPThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N2-N1
Wherein the equilibrium value alpha is greater than or equal to
Figure FDF0000015077800000024
The smallest integer value of (c);
(4-2) determining a reduced non-reporting user sample set Y using Mahalanobis distance-based ODRodr
(4-2-1) calculating the per-failure-reporting-user data gmUser data g without reporting faultlMahalanobis distance d (g) therebetweenm,gl) Wherein g ism∈Gmaj,gl∈GmajAnd g isl≠gm
(4-2-2) according to d (g)m,gl) Calculate GmajIn each sample gmIs associated with
Figure FDF0000015077800000025
Define Association set CmMeans GmajRemoving g frommK of other samples3Each nearest neighbor contains gmThe sample set of (1);
(4-2-3) according to the presence or absence of gmFor gmnThe influence of the judgment accuracy of the K-NN-algorithm on gmClassification therein of gmn∈Cm
Determining that K of the K-NN algorithm in the step is an odd number K4(ii) a Calculate gmThe K-NN algorithm pair gmnNumber Num of correct classificationp(ii) a Recalculate no gmThe K-NN algorithm pair gmnNumber Num of correct classificationno-pComparison NumpAnd Numno-pSize, g is determined according to the following criteriamAnd (4) classification:
satisfy Nump<Numno-pDescription of (1)mNegatively affected and divided into Noise sample set SNoisePerforming the following steps;
satisfy Nump=Numno-pDescription of (1)mIf necessary, divided into Safe sample sets SSafePerforming the following steps;
satisfy Nump>Numno-pDescription of (1)mUseful and divided into Save sample set SSavePerforming the following steps;
(4-2-4) preferential deletion of SNoiseThen delete SSafeUntil the failure-free sample set meets the condition, finally outputting all processed data sets Gsmote+odr
Definition of YodrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from SNoiseThen is SSafe(ii) a Deleted YodrThe total number is more than or equal to beta N2I.e. processed failure-free sample set { Gmaj-YodrThe total number of the (1-beta) N is less than or equal to2
After the Mahalanobis ODR and BSMOTE algorithms, the entire data set Gsmote+odrComprises the following steps:
Gsmote+odr={Gmaj-Yodr}+{Gmin+Ybsmote}
and 5: data set G using the TOMEK algorithmsmote+odrCleaning the data to obtain cleaned data Gsmote+odr+tomek
(5-1) initialization Gsmote+odr+tomekGathering;
(5-2) random slave Gsmote+odrIn which a sample point g is extractediAnd is combined withAt Gsmote+odrFind the nearest neighbor point gj,gj≠gi
(5-3) at Gsmote+odrMiddle search and gjNearest neighbor point gk,gk≠gj
(5-4) determination of gi==gkIf yes, continue to execute (5-5), otherwise make gi=gj,gj=gkThen jumping to the step (5-3);
(5-5) determination of giAnd gkWhether the corresponding user fault reporting categories are consistent or not; if the two sample points are consistent, the two sample points are saved to a sample set Gsmote+odr+tomekThen from Gsmote+odrDeleting the two sample points; if the categories are not consistent, directly starting from Gsmote+odrDeleting the two sample points;
(5-6) judgment of sample set Gsmote+odrWhether the number of (2) is an even number greater than 0; if the number is an even number larger than 0, returning to the step (5-2), otherwise, ending the exit;
step 6: g is to besmote+odr+tomekThe data in the step (a) is brought into an SVM classifier for training, the kernel width sigma of the SVM classifier is adaptively adjusted by combining the step length of the thickness and the step length, an approximate optimal global point is searched, and the corresponding sigma is determinedoptimal
(6-1) determining the Kernel function of the SVM classifier as a Gaussian Kernel function
Figure FDF0000015077800000031
Figure FDF0000015077800000032
Wherein g isx∈Gsmote+odr+tomek,
Figure FDF0000015077800000033
Is gxσ is the gaussian kernel width;
(6-2) determining geometric mean values G-mean and F-measure of the accurate evaluation criterion of the model:
the mathematical expressions of the user fault reporting Recall rate Recall _ Min, the user fault reporting Precision _ Min, the user non-fault reporting Recall rate Recall _ Maj, the geometric mean value G-mean and the F-mean are respectively as follows:
Figure FDF0000015077800000034
Figure FDF0000015077800000035
Figure FDF0000015077800000036
wherein TN is a value of classifying as not reporting fault and predicting as not reporting fault for the user, FN is a value of classifying as reporting fault and predicting as not reporting fault for the user, FP is a value of classifying as not reporting fault and predicting as reporting fault for the user, and TP is a value of classifying as reporting fault and predicting as reporting fault for the user;
(6-3) initializing penalty factor C, kernel width sigma and kernel width maximum sigma of SVM classifiermaxCoarse step length, then entering SVM classifier operation to obtain the optimal local points of G-mean and F-measure;
changing sigma by coarse step length, updating the best local point after obtaining better SVM classification result each time until sigma > sigmamaxThen ending; at this time, the best local point is selected;
(6-4) adaptively changing the kernel width sigma in fine steps from the left side of the optimal local point, and when G-mean and F-measure become the approximate optimal global point, obtaining the corresponding approximate optimal kernel width sigmaoptimalAnd outputting the classification result;
and 7: inputting IPTV user data to be predicted into a detector of the trained SVM, predicting whether the user reports the fault or not, and realizing early warning on the IPTV fault-reporting user.
2. The method as claimed in claim 1, wherein in step 1, the numerical indicator is extracted from a record containing user id, indicator and failure reporting time information.
3. The method as claimed in claim 1, wherein in step 3, the value of the equilibrium value β is 0.2-0.5.
CN201610392603.XA 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set Active CN106056160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610392603.XA CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610392603.XA CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Publications (2)

Publication Number Publication Date
CN106056160A CN106056160A (en) 2016-10-26
CN106056160B true CN106056160B (en) 2022-05-17

Family

ID=57170278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610392603.XA Active CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Country Status (1)

Country Link
CN (1) CN106056160B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180246A (en) * 2017-04-17 2017-09-19 南京邮电大学 A kind of IPTV user's report barrier data synthesis method based on mixed model
CN107392259B (en) * 2017-08-16 2021-12-07 北京京东尚科信息技术有限公司 Method and device for constructing unbalanced sample classification model
CN112235293B (en) * 2020-10-14 2022-09-09 西北工业大学 Over-sampling method for balanced generation of positive and negative samples in malicious flow detection
CN112801151B (en) * 2021-01-18 2022-04-12 桂林电子科技大学 Wind power equipment fault detection method based on improved BSMOTE-Sequence algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254177A (en) * 2011-04-22 2011-11-23 哈尔滨工程大学 Bearing fault detection method for unbalanced data SVM (support vector machine)
CN102663412A (en) * 2012-02-27 2012-09-12 浙江大学 Power equipment current-carrying fault trend prediction method based on least squares support vector machine
CN103954300A (en) * 2014-04-30 2014-07-30 东南大学 Fiber optic gyroscope temperature drift error compensation method based on optimized least square-support vector machine (LS-SVM)
CN104574220A (en) * 2015-01-30 2015-04-29 国家电网公司 Power customer credit assessment method based on least square support vector machine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7707129B2 (en) * 2006-03-20 2010-04-27 Microsoft Corporation Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254177A (en) * 2011-04-22 2011-11-23 哈尔滨工程大学 Bearing fault detection method for unbalanced data SVM (support vector machine)
CN102663412A (en) * 2012-02-27 2012-09-12 浙江大学 Power equipment current-carrying fault trend prediction method based on least squares support vector machine
CN103954300A (en) * 2014-04-30 2014-07-30 东南大学 Fiber optic gyroscope temperature drift error compensation method based on optimized least square-support vector machine (LS-SVM)
CN104574220A (en) * 2015-01-30 2015-04-29 国家电网公司 Power customer credit assessment method based on least square support vector machine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Study of the Behavior of Serveral Methods for Balancing machine Learning Training Data;Gustavo Enrique Batlata et al;《ACM SIGKDD Explorations Newsletter》;20040630;第6卷(第1期);第20-29页 *
不均衡数据分类算法综述;陶新民 等;《重庆邮电大学学报(自然科学版)》;20130228;第25卷(第1期);第101-110页,第121页 *
基于ODR和BSMOTE结合的不均衡数据SVM分类算法;陶新民 等;《控制与决策》;20111031;第26卷(第10期);第1535-1541页 *

Also Published As

Publication number Publication date
CN106056160A (en) 2016-10-26

Similar Documents

Publication Publication Date Title
CN106056160B (en) User fault reporting prediction method under unbalanced IPTV data set
CN111695626B (en) High-dimensionality unbalanced data classification method based on mixed sampling and feature selection
CN105138653B (en) It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty
CN111314353B (en) Network intrusion detection method and system based on hybrid sampling
CN105069072B (en) Hybrid subscriber score information based on sentiment analysis recommends method and its recommendation apparatus
CN110572382A (en) Malicious flow detection method based on SMOTE algorithm and ensemble learning
CN111797321A (en) Personalized knowledge recommendation method and system for different scenes
WO2019184640A1 (en) Indicator determination method and related device thereto
CN105243394B (en) Evaluation method of the one type imbalance to disaggregated model performance influence degree
WO2016155493A1 (en) Data processing method and apparatus
CN103365829A (en) Information processing apparatus, information processing method, and program
CN112488716A (en) Abnormal event detection system
Lumauag et al. An enhanced recommendation algorithm based on modified user-based collaborative filtering
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
Vall et al. The Importance of Song Context in Music Playlists.
Aziz et al. Cluster Analysis-Based Approach Features Selection on Machine Learning for Detecting Intrusion.
CN105938561A (en) Canonical-correlation-analysis-based computer data attribute reduction method
CN112422546A (en) Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering
CN106372655A (en) Synthetic method for minority class samples in non-balanced IPTV data set
CN109120961A (en) The prediction technique of the QoE of IPTV unbalanced dataset based on PNN-PSO algorithm
CN108564445B (en) Method and device for recommending interest-based projects
Moon et al. Minority-oriented vicinity expansion with attentive aggregation for video long-tailed recognition
CN111667339B (en) Defamation malicious user detection method based on improved recurrent neural network
CN113918435A (en) Application program risk level determination method and device and storage medium
Liu et al. A novel Kalman Filter based shilling attack detection algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant