CN106056160A - User fault-reporting prediction method in unbalanced IPTV data set - Google Patents

User fault-reporting prediction method in unbalanced IPTV data set Download PDF

Info

Publication number
CN106056160A
CN106056160A CN201610392603.XA CN201610392603A CN106056160A CN 106056160 A CN106056160 A CN 106056160A CN 201610392603 A CN201610392603 A CN 201610392603A CN 106056160 A CN106056160 A CN 106056160A
Authority
CN
China
Prior art keywords
user
odr
sample
barrier
report
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610392603.XA
Other languages
Chinese (zh)
Other versions
CN106056160B (en
Inventor
周亮
吴志峰
黄若尘
魏昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201610392603.XA priority Critical patent/CN106056160B/en
Publication of CN106056160A publication Critical patent/CN106056160A/en
Application granted granted Critical
Publication of CN106056160B publication Critical patent/CN106056160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a user fault-reporting prediction method in an unbalanced IPTV data set. The user fault-reporting prediction method mainly comprises the steps of: (1) importing IPTV user watching records and extracting numerical indexes; (2) averaging the watching records of each user; (3) initializing an equilibrium value beta; (4) deleting non-fault-reporting samples by adopting ODR and BSMOTE algorithms based on Mahalanobis distance, and increasing manual fault-reporting samples; (5) deleting newly-increased samples having negative impact on classification by adopting a TOMEK algorithm; (6) putting a reconstructed sample data set into an SVM classifier with self-adaptively variable kernel bandwidth for training; (7) and inputting IPTV user data to be predicted into a trained SVM detector. Since the improved BSMOTE and ODR algorithms adopted by the user fault-reporting prediction method are based on the Mahalanobis distance, information superposition caused by multiple correlations of variables is avoided, the algorithms are free from the influence of different dimensions among sample point attributes, better sample data transforming effect is obtained, the interference of a noise point and a redundant point on fault-reporting prediction is weakened, and the prediction accuracy of the classifier is significantly increased.

Description

User under lack of balance IPTV data set reports barrier Forecasting Methodology
Technical field
The invention belongs to the technical field of IPTV data analysis and process, be specifically related under a kind of lack of balance IPTV data set User report barrier Forecasting Methodology.
Background technology
Along with the fast development of multimedia communication technology, the IPTV (Internet based on broadband internet Protocol Television), i.e. IPTV, great convenience ordinarily resident enjoys interactive mode, individual character at home The Video service change, freely customized and valued added applications service.But in video transmitting procedure, traditional network service quality When (Quality of Service, QoS) such as bandwidth, packet loss, delay and jitter occur to deteriorate, use can be affected to a certain extent The viewing experience at family, and then cause customer complaint report to hinder.Wherein, the ratio that the user of report barrier accounts for overall user is the least, number of users According to unavoidably becoming lack of balance data set, and increasingly mature along with IPTV technology, lack of balance ratio will persistently increase.
Prediction user reports whether to hinder be a typical two-value classification problem.The ripe algorithm of this problem of conventional process includes Support vector machine (Support Vector Machine, SVM), but the classification performance of SVM increases along with data inequality extent And reduce.It is thus desirable to after unbalanced data are converted into equalization data collection by the algorithm in data plane, then divided by SVM Class device is classified.And the algorithm in traditional data aspect is frequently with over-sampling BSMOTE based on Euclidean distance (Borderline-Synthetic Minority Oversampling Technique) algorithm or employing are based on Euclidean distance Lack sampling ODR (Optimization of Decreasing Reduction) algorithm data are processed.Although these Algorithm can improve forecasting accuracy, but the information overlap unavoidably caused because emphasizing the multiple correlation of variable, simultaneously Also cannot guarantee that the artificial sample point generated is not for noise spot.
Grant number is CN102254177B, entitled " a kind of unbalanced data SVM Bearing Fault Detection Method " domestic Invention provides a kind of unbalanced data SVM Bearing Fault Detection Method, and its weak point is that (1) algorithm uses Euclidean distance to hold Easily affected by dimensions different between sample point attribute;(2) the impurity artificial sample point generated for BSMOTE algorithm, lacks Effective minimizing technology;(3) the SVM algorithm center width advantage to promoting classification accuracy is the most fully excavated.
Summary of the invention
It is an object of the invention to solve present in prior art algorithm easily by dimensions different between sample point attribute Impact, lacks the effective minimizing technology for impurity artificial sample point and can the most fully excavate SVM algorithm center width Spend defects such as the advantages promoting classification accuracy.
To this end, the user that the present invention proposes under a kind of lack of balance IPTV data set reports barrier Forecasting Methodology.The method include as Lower step:
Step 1: importing IPTV watching record of user, record is containing the information such as user id, index, report Downtime, the present invention Method only extracts numeric type index therein, and its argument table is shown as z;
Setting the IPTV total number of users imported is D as N, the total number of records, and wherein report barrier user has N1Individual, do not report barrier user to have N2Individual, nth user contains Dn(n=1 .., N) bar record.Numeric type index dimension is Q, represents that numeric type index becomes with z Amount, respectively z1,z2,...,zQ, each index zqValue
Step 2: the record g after each user is obtained averagelyn(n=1 ..., N) specific as follows:
Calculate the respective average of Q index of nth user
z n q ‾ = 1 D n Σ d = 1 D n z d q , ( n = 1 , .. , N ; q = 1 , ... , Q )
The most each user only remains a record after pretreatmentAnd set Fixed by N1The data set of individual minority class report barrier user's composition isBy N2Individual most class does not report barrier user's composition Data set beThe data set of total user's composition is G=Gmin∪Gmaj
Step 3: initialize equilibrium valve β based on mahalanobis distance ODR algorithm;
Determine equilibrium valve β, if equilibrium valve β value is too small, the minimizing DeGrain to most classes, otherwise, if equilibrium valve β Value is excessive, it is likely that can delete valuable most class sample, its span is 0.2≤β≤0.5 by mistake.
Step 4: use BSMOTE algorithm based on mahalanobis distance to increase artificial report barrier user sample set Ybsmote, and determine Equilibrium valve α of BSMOTE algorithm.Then use ODR algorithm based on mahalanobis distance to reduce and do not report barrier user sample set Yodr, it is achieved Equalization data collection Gsmote+odr
(4-1) BSMOTE based on mahalanobis distance is used to determine the artificial report barrier user sample set Y of increasebsmote:
(4-1-1) each report barrier user data g is calculatedi∈GminWith other user data gj∈G(gj≠giHorse between) Family name distance d (gi,gj)。
d ( g i , g j ) = ( g i - g j ) T Σ - 1 ( g i - g j ) , ( g i ≠ g j )
Wherein, ∑-1Covariance matrix for total user data set G.
(4-1-2) according to d (gi,gj) and use K arest neighbors (K-Nearest Neighbor, K-NN) algorithm to the n-th (n= 1,..,N1) individual report barrier user determine its a series of nearest samples collection Gn-KNN, and determine affiliated sample set.
Determine the odd number K in K-NN algorithm1Value, it is judged that the nearest samples of report barrier user is concentrated to belong to and do not reported the individual of barrier Number.
If meetingThen this report barrier user's sample is divided into Border sample set GBorder In.
If | Gn-KNN∩Gmaj|=φ, then this report barrier user's sample is divided into Safe sample set GSafeIn.
If | Gn-KNN∩Gmaj|=K1, then this report barrier user's sample is divided into Noise sample set GNoiseIn.
Wherein Gn-KNNRepresent the K around n-th report barrier user's sample point1Individual sample point, has
(4-1-3) statistics gp(gp∈GBorder) at GminIn random K2Neighbour's sample setAnd calculate gpWithAttribute difference hpk
Statistics GBorder={ g1,..,gp,...,gPEach report barrier user sample g in }pAt sample set GminIn random K2 Individual arest neighborsSum in wherein P is Border sample set.Calculate sample gpkWith this Report barrier user sample gpBetween difference h of whole attributespk:
hpk=gp-gpk, (p=1 ..., P;K=1 ..., K2)
(4-1-4) to gp(gp∈GBorder) all generate artificial report barrier sample set Yp
If gpk∈GNoiseOr gpk∈GSafe, then hpkIt is multiplied by a random number rpk∈(0,0.5).If gpk∈GBorder, that HpkIt is multiplied by a random number rpk∈ (0,1), then be each gpThe artificial sample y generatedpk:
ypk=gk+|rpk×hpk|, (p=1 ..., P;K=1 ..., K2)
The artificial report barrier user's sample set ultimately produced is:
(4-1-5) repeat step (4-1-3), (4-1-4), calculate GBorderIn each report barrier user newly-increased sample set Yp (p=1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generatedbsmote={ Y1,...,YPComprise in } is newly-increased Report barrier total sample number is more than or equal to (1-β) N2-N1
Wherein equilibrium valve α takes and is more than or equal toSmallest positive integral value.
(4-2) not reporting of minimizing hinders user sample set Y to use ODR based on mahalanobis distance to determineodr:
(4-2-1) calculate each report and hinder user data gm(gm∈Gmaj) do not report barrier user data g with otherl(gl∈G;gl ≠gmMahalanobis distance d (g between)m,gl)。
(4-2-2) according to d (gm,gl) calculate GmajIn each sample gmIncidence set
Definition incidence set CmRefer to GmajIn except gmThe K of other samples3Containing g in individual arest neighborsmSample set.Use gmnTable Show sample point gn(gn∈Gmaj) K3Individual arest neighbors comprises gmSample point, then several gmnThe sample set C of compositionmIt is exactly gmSample The incidence set of point.
(4-2-3) according to or without gmTo gmn(gmn∈Cm) K4The impact of-NN algorithm judgment accuracy, to gmClassification.
Determine odd number K4.Calculating has gmTime, K4-NN algorithm is to gmn(gmn∈Cm) correct number Num of classifyingp.Calculate nothing again gmTime, K4-NN is to gmn(gmn∈Cm) correct number Num of classifyingno-p, compare NumpAnd Numno-pSize, will according to following criterion gmClassification:
Meet Nump≤Numno-pTime, g is describedmPlay negative interaction and be divided into Noise sample set SNoiseIn.
Meet Nump=Numno-pTime, g is describedmNot essential and be divided into Safe sample set SSafeIn.
Meet Nump≥Numno-pTime, g is describedmUseful and be divided into Save sample set SSaveIn.
(4-2-4) S is preferentially deletedNoise, next deletes SSafeIn sample, until not reporting barrier sample set to meet condition, Whole set of data G after whole output processsmote+odr
Definition YodrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from SNoise, secondly it is SSafe.The Y deletedodrTotal number is more than or equal to β N2, not reporting after i.e. processing hinders sample set { Gmaj-YodrTotal number is less than or equal to (1-β)N2
After ODR and the BSMOTE algorithm of mahalanobis distance, whole set of data Gsmote+odrFor:
Gsmote+odr={ Gmaj-Yodr}+{Gmin+Ybsmote}
Step 5: use TOMEK algorithm to data set Gsmote+odrCarry out data cleansing, the data after being cleaned Gsmote+odr+tomek
(5-1) G is initializedsmote+odr+tomekSet.
(5-2) random from Gsmote+odrIn extract sample point gi, and at Gsmote+odrThe point g of middle searching arest neighbors therewithj (gj≠gi)。
(5-3) at Gsmote+odrMiddle searching and gjThe point g of arest neighborsk(gk≠gj)。
(5-4) g is judgedi==gkWhether set up, if setting up, continuing executing with (5-5), otherwise making gi=gj,gj=gk, so After jump to step (5-3).
(5-5) g is judgediAnd gkCorresponding class of subscriber (report hinders or do not reports barrier) is the most consistent.If it is consistent, then by the two sample This point preserves to sample set Gsmote+odr+tomek, then from Gsmote+odrMiddle deletion the two sample point.If classification is inconsistent, the most directly Connect from Gsmote+odrMiddle deletion the two sample point.
(5-6) judgment sample collection Gsmote+odrIn number whether be the even number more than 0.If even number then repeats step (5- 2), otherwise terminate to exit.
Step 6: by Gsmote+odr+tomekIn data be brought in SVM classifier training, and with thickness step-length combine with from Adapt to adjust the core width cs of SVM classifier, find near-optimization global point, and determine the σ of correspondenceoptimal
(6-1) kernel function determining SVM classifier is gaussian kernel function
K ( g x , g x ‾ ) = exp ( - | | g x - g x ‾ | | 2 / 2 σ 2 )
Wherein gx∈Gsmote+odr+tomek,For gxAverage, σ is gaussian kernel width.
(6-2) determine that model accurately passes judgment on criterion geometrical mean G-mean and F-measure:
Confusion matrix according to classified sample set
User reports barrier recall rate Recall_Min, user to report barrier precision ratio Precision_Min, user not report barrier recall rate Recall_Maj, geometrical mean G-mean and F-measure mathematic(al) representation are as follows:
Re c a l l _ M i n = T P T P + F N , Pr e c i s i o n _ M i n = T P F P + T P , Re c a l l _ M a j = T N T N + F P
G - m e a n = Re c a l l _ M i n * Re c a l l _ M a j
F - m e a s u r e = 2 * Re c a l l _ M i n * Pr e c i s i o n _ M i n Re c a l l _ M i n + r e c i s i o n _ M i n
G-mean is to maintain user and reports barrier, user to maximize their precision in the case of not reporting barrier nicety of grading balance, The most only when Recall_Min and Recall_Maj is the highest when, the value of G-mean is only maximum.F- Measure index is a kind of evaluation of classification index considering recall ratio and precision ratio.F-measure can synthesis reveal point User is reported barrier and user not to report the classifying quality of barrier by class device, but more lays particular emphasis on user and report the classifying quality of barrier sample.
(6-3) SVM classifier penalty factor, core width cs, core width maximum σ are initializedmax, thick step-length, subsequently into SVM classifier computing, it is thus achieved that the optimal partial points of G-mean and F-measure.
With thick step-size change σ, obtaining each time more preferably after svm classifier result, thinner optimal partial points, until meeting σ > σmaxRear end.Now, the most optimal partial points is selected.
(6-4) from the left side of optimal partial points, core width cs is changed with thin adaptive step, as G-mean and F- When measure becomes near-optimization global point, it is thus achieved that corresponding near-optimization core width csoptimal, and output category result.
Step 7: by IPTV user data to be predicted, is input in the detector of SVM trained, it was predicted that user reports barrier Whether, it is achieved the early warning to IPTV report barrier user.
Further, in above-mentioned steps 1, described numeric type index is containing user id, index, report Downtime information from record The numeric type index of middle extraction.
Further, in step 3, the span of equilibrium valve β is 0.2≤β≤0.5.
Further, the confusion matrix of the classified sample set in step 6-2 is:
Compared with prior art, beneficial effects of the present invention:
Improvement BSMOTE and ODR algorithm employed in the present invention are based on mahalanobis distance, not only avoid the multiple of variable The information overlap that dependency is brought, is not affected by dimensions different between sample point attribute, thus is obtained more preferably sample Data correctional effect.
BSMOTE, ODR algorithm employed in the present invention and data cleansing TOMEK algorithm, on the one hand attenuating noise point and The redundant points interference to report barrier prediction, on the other hand strengthens the contribution to correct classification of the minority effective sample point.Remove again simultaneously BSMOTE algorithm generates the impure point being difficult to differentiate between judging on svm classifier border, grader prediction is greatly improved accurately Degree.
Thickness step-length employed in the present invention combines the algorithm with self-adaptative adjustment SVM classifier core width cs, only with very Little σ loss of significance is cost, just can significantly improve the accuracy of prediction, also ensures that this algorithm possesses high operation and imitates simultaneously Rate.
Accompanying drawing explanation
Fig. 1 is the flow chart that the user under the lack of balance IPTV data set of the present invention reports barrier Forecasting Methodology.
Fig. 2 is the flow chart of adaptive strain core width S VM that the present invention relates to.
Fig. 3 is embodiment of the present invention Plays SVM and the report of tradition ODR-BSMOTE-SVM hinders the schematic diagram that predicts the outcome.
Fig. 4 is that the report of innovatory algorithm in the embodiment of the present invention hinders the schematic diagram that predicts the outcome.
Detailed description of the invention
Below in conjunction with Figure of description, the invention is described in further detail.
User in order to be better described under the lack of balance IPTV data set that the present invention relates to reports barrier Forecasting Methodology, is answered In the early warning of IPTV report barrier.Training used in the present invention and test Data Source are in Jiangsu Telecom the whole province IPTV user's Data, have 439050 user data here, relate to 4723101 viewing records, wherein comprise 4871 report barrier users, relate to Article 48172, viewing record.Additionally, also 434179 do not report barrier user, relate to 4674929 viewing records, minority class and many The uneven ratio of number class is up to 1:89.The numeric type index dimension of each watching record of user is 10, takes in this example simultaneously K1=K3=K4=5, K2=3, initial balance value β=0.3, penalty factor=1000.
According to the flow process of summary of the invention (as shown in Figure 1), start report barrier user in predicting.
Step 1: importing IPTV watching record of user, record is containing the information such as user id, index, report Downtime, the present invention Method only extracts numeric type index therein, and its argument table is shown as z;
In this example, IPTV total number of users N=439050 of importing, the total number of records is D=4723101, and wherein report barrier is used There is N at family1=4871, do not report barrier user to have N2=434179, nth user contains Dn(n=1 .., 439050) bar record, numerical value Type index dimension is Q=10, represents numeric type index variable, respectively z with z1,z2,...,z10, each index zqValue
Step 2: the record g after each user is obtained averagelyn(n=1 ..., N) specific as follows:
Calculate the respective average of Q index of nth user
z n q ‾ = 1 D n Σ d = 1 D n z d q , ( n = 1 , .. , N ; q = 1 , ... , Q )
Each user only remains one after pretreatmentRecord, and Set by N1The data set of=4871 minority class report barrier user's compositions isBy N2=434179 majorities The data set that class does not report barrier user to form isThe data set of total user's composition is G=Gmin∪Gmaj
Step 3: initialize equilibrium valve β=0.3 based on mahalanobis distance ODR algorithm;
Step 4: use BSMOTE algorithm based on mahalanobis distance to increase artificial report barrier user sample set Ybsmote, and determine Equilibrium valve α of BSMOTE algorithm.Then use ODR algorithm based on mahalanobis distance to reduce and do not report barrier user sample set Yodr, it is achieved Equalization data collection Gsmote+odr
(4-1) BSMOTE based on mahalanobis distance is used to determine the artificial report barrier user sample set Y of increasebsmote:
(4-1-1) each report barrier user data g is calculatedi∈GminWith other user data gj∈G(gj≠giHorse between) Family name distance d (gi,gj)。
d ( g i , g j ) = ( g i - g j ) T Σ - 1 ( g i - g j ) , ( g i ≠ g j )
Wherein, ∑-1Covariance matrix for total user data set G.
(4-1-2) according to d (gi,gj) and use K arest neighbors (K-Nearest Neighbor, K-NN) algorithm to the n-th (n= 1,..,N1) individual report barrier user determine its a series of nearest samples collection Gn-KNN, and determine affiliated sample set.
Determine the odd number K in K-NN algorithm1=5 values, it is judged that the nearest samples of report barrier user is concentrated to belong to and do not reported barrier Number.
If meetingThen this report barrier user's sample is divided into Border sample set GBorderIn.
If | Gn-KNN∩GmaJ|=φ, then be divided into Safe sample set G by this report barrier user's sampleSafeIn.
If | Gn-KNN∩Gmaj|=5, then this report barrier user's sample is divided into Noise sample set GNoiseIn.
(4-1-3) statistics gp(gp∈GBorder) at GminIn random K2=3 neighbour's sample setsAnd calculate gpWithAttribute difference hpk
Statistics GBorder={ g1,..,gp,...,gPEach report barrier user sample g in }pAt sample set GminIn random K2 Individual arest neighborsCalculate sample gpkUser sample g is hindered with this reportpBetween whole attributes Difference hpk:
hpk=gp-gpk, (p=1 ..., P;K=1 ..., K2)
(4-1-4) to gp(gp∈GBorder) all generate artificial report barrier sample set Yp
If gpk∈GNoiseOr gpk∈GSafe, then hpkIt is multiplied by a random number rpk∈(0,0.5).If gpk∈GBorder, that HpkIt is multiplied by a random number rpk∈ (0,1), then be each gpThe artificial sample y generatedpk:
ypk=gk+ | rpk×hpk|, (p=1 ..., P;K=1 ..., K2)
The artificial report barrier user's sample set ultimately produced is:
(4-1-5) repeat step (4-1-3), (4-1-4), calculate GBorderIn each report barrier user newly-increased sample set Yp (p=1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generatedbsmote={ Y1,...,YPComprise in } is newly-increased Report barrier total sample number is more than or equal to (1-β) N2-N1=299054.
Wherein equilibrium valve α takes and is more than or equal toSmallest positive integral value.
(4-2) not reporting of minimizing hinders user sample set Y to use ODR based on mahalanobis distance to determineodr:
(4-2-1) calculate each report and hinder user data gm(gm∈Gmaj) do not report barrier user data g with otherl(gl∈G;gl ≠gmMahalanobis distance d (g between)m,gl)。
(4-2-2) according to d (gm,gl) calculate GmajIn each sample gmIncidence set
(4-2-3) according to or without gmTo gmn(gmn∈Cm) K4The impact of-NN algorithm judgment accuracy, to gmClassification.
Determine odd number K4=5.Calculating has gmTime, K4-NN algorithm is to gmn(gmn∈Cm) correct number Num of classifyingp.Count again Calculate without gmTime, K4-NN is to gmn(gmn∈Cm) correct number Num of classifyingno-p, compare NumpAnd Numno-pSize, according to following accurate Then by gmClassification:
Meet Nump≤Numno-pTime, g is describedmPlay negative interaction and be divided into Noise sample set SNoiseIn.
Meet Nump=Numno-pTime, g is describedmNot essential and be divided into Safe sample set SSafeIn.
Meet Nump≥Numno-pTime, g is describedmUseful and be divided into Save sample set SSaveIn.
(4-2-4) S is preferentially deletedNoise, next deletes SSafeIn sample, until not reporting barrier sample set to meet condition, Whole output process after whole set of data Gsmote+odr
Definition YodrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from SNoise, secondly it is SSafe.The Y deletedodrTotal number is more than or equal to β N2=130254, not reporting after i.e. processing hinders sample set { Gmaj-YodrTotal number Less than or equal to (1-β) N2=303925.
After ODR and the BSMOTE algorithm of mahalanobis distance, whole set of data Gsmote+odrFor:
Gsmote+odr={ Gmaj-Yodr}+{Gmin+Ybsmote}
Step 5: use TOMEK algorithm to data set Gsmote+odrCarry out data cleansing, the data after being cleaned Gsmote+odr+tomek
(5-1) G is initializedsmote+odr+tomekSet.
(5-2) random from Gsmote+odrIn extract sample point gi, and at Gsmote+odrThe point g of middle searching arest neighbors therewithj (gj≠gi)。
(5-3) at Gsmote+odrMiddle searching and gjThe point g of arest neighborsk(gk≠gj)。
(5-4) g is judgedi==gkWhether set up, if setting up, continuing executing with (5-5), otherwise making gi=gj,gj=gk, so After jump to step (5-3).
(5-5) g is judgediAnd gkCorresponding class of subscriber (report hinders or do not reports barrier) is the most consistent.If it is consistent, then by the two sample This point preserves to sample set Gsmote+odr+tomek, then from Gsmote+odrMiddle deletion the two sample point.If classification is inconsistent, the most directly Connect from Gsmote+odrMiddle deletion the two sample point.
(5-6) judgment sample collection Gsmote+odrIn number whether be the even number more than 0.If even number then repeats step (5- 2), otherwise terminate to exit.
Step 6: by Gsmote+odr+tomekIn data be brought in SVM classifier training, and with thickness step-length combine with from Adapt to adjust the core width cs of SVM classifier, find near-optimization global point, and determine the σ of correspondenceoptimal
(6-1) kernel function determining SVM classifier is gaussian kernel function
K ( g x , g x ‾ ) = exp ( - | | g x - g x ‾ | | 2 / 2 σ 2 )
(6-2) determine that model accurately passes judgment on criterion geometrical mean G-mean and F-measure.
(6-3) SVM classifier penalty factor, core width cs, core width maximum σ are initializedmax, thick step-length, subsequently into SVM classifier computing, it is thus achieved that the optimal partial points of G-mean and F-measure.
Initialize SVM classifier penalty factor=1000, core width cs=0.1, σmax=2, thick step-length is 0.1.Now with slightly Step-size change σ, is obtaining each time more preferably after svm classifier result, is updating optimal partial points, until terminating after meeting σ > 2, And obtain G-mean and F-measure optimal partial point..
(6-4) from the left side of optimal partial points, core width cs is changed with thin adaptive step, as G-mean and F- When measure becomes near-optimization global point, it is thus achieved that corresponding near-optimization core width csoptimal, and output category result.
As shown in Fig. 2 flow chart, after determining that thin step-length is 0.01, obtain optimal partial points σ=0.2 from the inventive method Left side starts to change σ, finally gives near-optimization global point and corresponding σoptimal=0.21.
Step 7: by IPTV user data to be predicted, is input in the detector of SVM trained, it was predicted that user reports barrier Whether, it is achieved the early warning to IPTV report barrier user.
Performance evaluation
Result obtained by Forecasting Methodology involved in the present invention for employing is compared with correct category result, thus Can evaluate and weigh effectiveness and the accuracy of method involved in the present invention.Mark is can be seen that from (a) and (b) of Fig. 3 The optimum that quasi-SVM algorithm obtains will be near core width cs=0.3.Report now hinders and does not report the recall rate of barrier about to exist About 65%, but the value of G-mean and F-measure is the lowest, all below 0.1.From (c) and (d) of Fig. 3 permissible Find out that classifying quality relatively standard SVM of tradition ODR-BSMOTE-SVM algorithm increases, and gaussian kernel width cs be before 0.2, It is to obtain preferable G-mean and F-measure.And from (a) and (b) of Fig. 4, can be seen that the inventive method classifying quality It is substantially better than the first two algorithm, and gaussian kernel width cs is before 0.2, be to obtain good G-mean and F-measure.From (a) and (b) of Fig. 4 can be seen that, the inventive method, after meticulous step-length, can determine that core width cs can be approximated at 0.21 Optimal classification effect.The user that standard SVM, tradition ODR-BSMOTE-SVM and the inventive method record reports barrier recall rate successively For: 64.0%, 71.7%, 92.6%, user does not report barrier recall rate to be respectively as follows: 69.04%, 71.78%, 93.08%.Therefore, Use the inventive method can obtain more preferably estimated performance.
It should be noted that the above is not in order to limit the present invention, all within the spirit and principles in the present invention, institute Any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

Claims (4)

1. the user under lack of balance IPTV data set reports barrier Forecasting Methodology, it is characterised in that comprise the steps of
Step 1: import IPTV watching record of user, extracts numeric type index, and its argument table is shown as z;
Setting the IPTV total number of users imported is D as N, the total number of records, and wherein report barrier user has N1Individual, do not report barrier user to have N2It is individual, Nth user contains Dn(n=1 .., N) bar record;Numeric type index dimension is Q, represents numeric type index variable with z, point Wei z1,z2,...,zQ, each index zqValue
Step 2: the record g after each user is obtained averagelyn(n=1 ..., N) specific as follows:
Calculate the respective average of Q index of nth user
z n q ‾ = 1 D n Σ d = 1 D n z d q , ( n = 1 , .. , N ; q = 1 , ... , Q )
The most each user only remains a record after pretreatmentAnd set by N1 The data set of individual minority class report barrier user's composition isBy N2Individual most class does not report the data of barrier user's composition Collection isThe data set of total user's composition is G=Gmin∪Gmaj
Step 3: initialize equilibrium valve β based on mahalanobis distance ODR algorithm;
Step 4: use BSMOTE algorithm based on mahalanobis distance to increase artificial report barrier user sample set Ybsmote, and determine Equilibrium valve α of BSMOTE algorithm;Then use ODR algorithm based on mahalanobis distance to reduce and do not report barrier user sample set Yodr, it is achieved Equalization data collection Gsmote+odr
(4-1) BSMOTE based on mahalanobis distance is used to determine the artificial report barrier user sample set Y of increasebsmote:
(4-1-1) each report barrier user data g is calculatedi∈GminWith other user data gj∈G(gj≠giMahalanobis distance between) d(gi,gj);
d ( g i , g j ) = ( g i - g j ) T Σ - 1 ( g i - g j ) , ( g i ≠ g j )
Wherein, ∑-1Covariance matrix for total user data set G;
(4-1-2) according to d (gi,gj) and use K-NN algorithm to the n-th (n=1 .., N1) individual report barrier user determines that they are a series of Neighbour sample set Gn-KNN, and determine affiliated sample set;
Determine the odd number K in K-NN algorithm1Value, it is judged that the nearest samples of report barrier user is concentrated and belonged to the number not reporting barrier;
If meetingThen this report barrier user's sample is divided into Border sample set GBorderIn;
If | Gn-KNN∩Gmaj|=φ, then this report barrier user's sample is divided into Safe sample set GSafeIn;
If | Gn-KNN∩Gmaj|=K1, then this report barrier user's sample is divided into Noise sample set GNoiseIn;
(4-1-3) statistics gp(gp∈GBorder) at GminIn random K2Neighbour's sample setAnd calculate gpWithAttribute difference hpk
Statistics GBorder={ g1,..,gp,...,gPEach report barrier user sample g in }pAt sample set GminIn random K2Individual NeighbourSum in wherein P is Border sample set;Calculate sample gpkHinder with this report User sample gpBetween difference h of whole attributespk:
hpk=gp-gpk, (p=1 ..., P;K=1 ..., K2)
(4-1-4) to gp(gp∈GBorder) all generate artificial report barrier sample set Yp
If gpk∈GNoiseOr gpk∈GSafe, then hpkIt is multiplied by a random number rpk∈(0,0.5);If gpk∈GBorder, then hpk It is multiplied by a random number rpk∈ (0,1), then be each gpThe artificial sample y generatedpk:
ypk=gk+|rpk×hpk|, (p=1 ..., P;K=1 ..., K2)
The artificial report barrier user's sample set ultimately produced is:
(4-1-5) repeat step (4-1-3), (4-1-4), calculate GBorderIn each report barrier user newly-increased sample set Yp(p= 1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generatedbsmote={ Y1,...,YPThe newly-increased report barrier comprised in } Total sample number is more than or equal to (1-β) N2-N1
Wherein equilibrium valve α takes and is more than or equal toSmallest positive integral value;
(4-2) not reporting of minimizing hinders user sample set Y to use ODR based on mahalanobis distance to determineodr:
(4-2-1) calculate each report and hinder user data gm(gm∈Gmaj) do not report barrier user data g with otherl(gl∈G;gl≠ gmMahalanobis distance d (g between)m,gl);
(4-2-2) according to d (gm,gl) calculate GmajIn each sample gmIncidence set
Definition incidence set CmRefer to GmajIn except gmThe K of other samples3Containing g in individual arest neighborsmSample set;
(4-2-3) according to or without gmTo gmn(gmn∈Cm) K4The impact of-NN algorithm judgment accuracy, to gmClassification;
Determine odd number K4;Calculating has gmTime, K4-NN algorithm is to gmn(gmn∈Cm) correct number Num of classifyingp;Calculate without g againmTime, K4-NN is to gmn(gmn∈Cm) correct number Num of classifyingno-p, compare NumpAnd Numno-pSize, according to following criterion by gmPoint Class:
Meet Nump≤Numno-pTime, g is describedmPlay negative interaction and be divided into Noise sample set SNoiseIn;
Meet Nump=Numno-pTime, g is describedmNot essential and be divided into Safe sample set SSafeIn;
Meet Nump≥Numno-pTime, g is describedmUseful and be divided into Save sample set SSaveIn;
(4-2-4) S is preferentially deletedNoise, next deletes SSafeIn sample, until do not report barrier sample set meet condition, the most defeated Go out whole set of data G after processingsmote+odr
Definition YodrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from SNoise, it is secondly SSafe;Delete The Y removedodrTotal number is more than or equal to β N2, not reporting after i.e. processing hinders sample set { Gmaj-YodrTotal number is less than or equal to (1-β) N2
After ODR and the BSMOTE algorithm of mahalanobis distance, whole set of data Gsmote+odrFor:
Gsmote+odr={ Gmaj-Yodr}+{Gmin+Ybsmote}
Step 5: use TOMEK algorithm to data set Gsmote+odrCarry out data cleansing, the data after being cleaned Gsmote+odr+tomek
(5-1) G is initializedsmote+odr+tomekSet;
(5-2) random from Gsmote+odrIn extract sample point gi, and at Gsmote+odrThe point g of middle searching arest neighbors therewithj(gj≠ gi);
(5-3) at Gsmote+odrMiddle searching and gjThe point g of arest neighborsk(gk≠gj);
(5-4) g is judgedi==gkWhether set up, if setting up, continuing executing with (5-5), otherwise making gi=gj,gj=gk, then jump Forward step (5-3) to;
(5-5) g is judgediAnd gkCorresponding class of subscriber (report hinders or do not reports barrier) is the most consistent;If it is consistent, then by the two sample point Preserve to sample set Gsmote+odr+tomek, then from Gsmote+odrMiddle deletion the two sample point;If classification is inconsistent, then directly from Gsmote+odrMiddle deletion the two sample point;
(5-6) judgment sample collection Gsmote+odrIn number whether be the even number more than 0;If even number then repeats step (5-2), Otherwise terminate to exit;
Step 6: by Gsmote+odr+tomekIn data be brought in SVM classifier training, and combine by thickness step-length with self adaptation Adjust the core width cs of SVM classifier, find near-optimization global point, and determine the σ of correspondenceoptimal
(6-1) kernel function determining SVM classifier is gaussian kernel function
K ( g x , g x ‾ ) = exp ( - | | g x - g x ‾ | | 2 / 2 σ 2 )
Wherein gx∈Gsmote+odr+tomek,For gxAverage, σ is gaussian kernel width;
(6-2) determine that model accurately passes judgment on criterion geometrical mean G-mean and F-measure:
According to the confusion matrix of classified sample set, user reports barrier recall rate Recall_Min, user to report barrier precision ratio Precision_Min, user do not report barrier recall rate Recall_Maj, geometrical mean G-mean and F-measure mathematical expression Formula is as follows:
Re c a l l _ M i n = T P T P + F N , Pr e c i s i o n _ M i n = T P F P + T P , Re c a l l _ M a j = T N T N + F P
G - m e a n = Re c a l l _ M i n * Re c a l l _ M a j
F - m e a s u r e = 2 * Re c a l l _ M i n * Pr e c i s i o n _ M i n Re c a l l _ M i n + r e c i s i o n _ M i n
(6-3) SVM classifier penalty factor, core width cs, core width maximum σ are initializedmax, thick step-length, subsequently into SVM Grader computing, it is thus achieved that the optimal partial points of G-mean and F-measure;
With thick step-size change σ, obtaining each time more preferably after svm classifier result, thinner optimal partial points, until meeting σ > σmaxRear end;Now, the most optimal partial points is selected;
(6-4) from the left side of optimal partial points, core width cs is changed with thin adaptive step, when G-mean with F-measure becomes During for near-optimization global point, it is thus achieved that corresponding near-optimization core width csoptimal, and output category result;
Step 7: by IPTV user data to be predicted, is input in the detector of SVM trained, it was predicted that user report barrier with No, it is achieved the early warning to IPTV report barrier user.
User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step 1 In, described numeric type index is the numeric type index extracted from record is containing user id, index, report Downtime information.
User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step 3 In, the span of equilibrium valve β is 0.2≤β≤0.5.
User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step The confusion matrix of classified sample set described in 6-2 is:
CN201610392603.XA 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set Active CN106056160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610392603.XA CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610392603.XA CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Publications (2)

Publication Number Publication Date
CN106056160A true CN106056160A (en) 2016-10-26
CN106056160B CN106056160B (en) 2022-05-17

Family

ID=57170278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610392603.XA Active CN106056160B (en) 2016-06-06 2016-06-06 User fault reporting prediction method under unbalanced IPTV data set

Country Status (1)

Country Link
CN (1) CN106056160B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180246A (en) * 2017-04-17 2017-09-19 南京邮电大学 A kind of IPTV user's report barrier data synthesis method based on mixed model
CN107392259A (en) * 2017-08-16 2017-11-24 北京京东尚科信息技术有限公司 The method and apparatus for building unbalanced sample classification model
CN112235293A (en) * 2020-10-14 2021-01-15 西北工业大学 Over-sampling method for balanced generation of positive and negative samples for malicious flow detection
CN112801151A (en) * 2021-01-18 2021-05-14 桂林电子科技大学 Wind power equipment fault detection method based on improved BSMOTE-Sequence algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN102254177A (en) * 2011-04-22 2011-11-23 哈尔滨工程大学 Bearing fault detection method for unbalanced data SVM (support vector machine)
CN102663412A (en) * 2012-02-27 2012-09-12 浙江大学 Power equipment current-carrying fault trend prediction method based on least squares support vector machine
CN103954300A (en) * 2014-04-30 2014-07-30 东南大学 Fiber optic gyroscope temperature drift error compensation method based on optimized least square-support vector machine (LS-SVM)
CN104574220A (en) * 2015-01-30 2015-04-29 国家电网公司 Power customer credit assessment method based on least square support vector machine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN102254177A (en) * 2011-04-22 2011-11-23 哈尔滨工程大学 Bearing fault detection method for unbalanced data SVM (support vector machine)
CN102663412A (en) * 2012-02-27 2012-09-12 浙江大学 Power equipment current-carrying fault trend prediction method based on least squares support vector machine
CN103954300A (en) * 2014-04-30 2014-07-30 东南大学 Fiber optic gyroscope temperature drift error compensation method based on optimized least square-support vector machine (LS-SVM)
CN104574220A (en) * 2015-01-30 2015-04-29 国家电网公司 Power customer credit assessment method based on least square support vector machine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUSTAVO ENRIQUE BATLATA ET AL: "A Study of the Behavior of Serveral Methods for Balancing machine Learning Training Data", 《ACM SIGKDD EXPLORATIONS NEWSLETTER》 *
陶新民 等: "不均衡数据分类算法综述", 《重庆邮电大学学报(自然科学版)》 *
陶新民 等: "基于ODR和BSMOTE结合的不均衡数据SVM分类算法", 《控制与决策》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180246A (en) * 2017-04-17 2017-09-19 南京邮电大学 A kind of IPTV user's report barrier data synthesis method based on mixed model
CN107392259A (en) * 2017-08-16 2017-11-24 北京京东尚科信息技术有限公司 The method and apparatus for building unbalanced sample classification model
CN107392259B (en) * 2017-08-16 2021-12-07 北京京东尚科信息技术有限公司 Method and device for constructing unbalanced sample classification model
CN112235293A (en) * 2020-10-14 2021-01-15 西北工业大学 Over-sampling method for balanced generation of positive and negative samples for malicious flow detection
CN112235293B (en) * 2020-10-14 2022-09-09 西北工业大学 Over-sampling method for balanced generation of positive and negative samples in malicious flow detection
CN112801151A (en) * 2021-01-18 2021-05-14 桂林电子科技大学 Wind power equipment fault detection method based on improved BSMOTE-Sequence algorithm

Also Published As

Publication number Publication date
CN106056160B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN106056160A (en) User fault-reporting prediction method in unbalanced IPTV data set
CN105138653B (en) It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty
WO2021179544A1 (en) Sample classification method and apparatus, computer device, and storage medium
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN105243394B (en) Evaluation method of the one type imbalance to disaggregated model performance influence degree
CN110796313B (en) Session recommendation method based on weighted graph volume and item attraction model
JP2018538587A (en) Risk assessment method and system
WO2019169704A1 (en) Data classification method, apparatus, device and computer readable storage medium
CN110377521B (en) Target object verification method and device
WO2016155493A1 (en) Data processing method and apparatus
WO2019228000A1 (en) Method and device for evaluating value of user review
CN107784511A (en) A kind of customer loss Forecasting Methodology and device
CN107967488A (en) The sorting technique and categorizing system of a kind of server
CN105843876A (en) Multimedia resource quality assessment method and apparatus
WO2024114517A1 (en) Louvain algorithm-based content recommendation method
Oudah et al. A novel features set for internet traffic classification using burstiness
CN116701950B (en) Click rate prediction model depolarization method, device and medium for recommendation system
CN106372655A (en) Synthetic method for minority class samples in non-balanced IPTV data set
Tao et al. Column 2,0-Norm Regularized Factorization Model of Low-Rank Matrix Recovery and Its Computation
CN115022194B (en) Network security situation prediction method based on SA-GRU
CN108984630B (en) Application method of node importance in complex network in spam webpage detection
CN108830460B (en) Method for relieving data sparsity of recommendation system based on step-by-step dynamic filling
CN113746707B (en) Encrypted traffic classification method based on classifier and network structure
CN116522260A (en) Learning model for online classification of network traffic based on extreme learning machine algorithm
CN112541010A (en) User gender prediction method based on logistic regression

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant