CN106056160A

CN106056160A - User fault-reporting prediction method in unbalanced IPTV data set

Info

Publication number: CN106056160A
Application number: CN201610392603.XA
Authority: CN
Inventors: 周亮; 吴志峰; 黄若尘; 魏昕
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2016-06-06
Filing date: 2016-06-06
Publication date: 2016-10-26
Anticipated expiration: 2036-06-06
Also published as: CN106056160B

Abstract

The invention discloses a user fault-reporting prediction method in an unbalanced IPTV data set. The user fault-reporting prediction method mainly comprises the steps of: (1) importing IPTV user watching records and extracting numerical indexes; (2) averaging the watching records of each user; (3) initializing an equilibrium value beta; (4) deleting non-fault-reporting samples by adopting ODR and BSMOTE algorithms based on Mahalanobis distance, and increasing manual fault-reporting samples; (5) deleting newly-increased samples having negative impact on classification by adopting a TOMEK algorithm; (6) putting a reconstructed sample data set into an SVM classifier with self-adaptively variable kernel bandwidth for training; (7) and inputting IPTV user data to be predicted into a trained SVM detector. Since the improved BSMOTE and ODR algorithms adopted by the user fault-reporting prediction method are based on the Mahalanobis distance, information superposition caused by multiple correlations of variables is avoided, the algorithms are free from the influence of different dimensions among sample point attributes, better sample data transforming effect is obtained, the interference of a noise point and a redundant point on fault-reporting prediction is weakened, and the prediction accuracy of the classifier is significantly increased.

Description

User under lack of balance IPTV data set reports barrier Forecasting Methodology

Technical field

The invention belongs to the technical field of IPTV data analysis and process, be specifically related under a kind of lack of balance IPTV data set User report barrier Forecasting Methodology.

Background technology

Along with the fast development of multimedia communication technology, the IPTV (Internet based on broadband internet Protocol Television), i.e. IPTV, great convenience ordinarily resident enjoys interactive mode, individual character at home The Video service change, freely customized and valued added applications service.But in video transmitting procedure, traditional network service quality When (Quality of Service, QoS) such as bandwidth, packet loss, delay and jitter occur to deteriorate, use can be affected to a certain extent The viewing experience at family, and then cause customer complaint report to hinder.Wherein, the ratio that the user of report barrier accounts for overall user is the least, number of users According to unavoidably becoming lack of balance data set, and increasingly mature along with IPTV technology, lack of balance ratio will persistently increase.

Prediction user reports whether to hinder be a typical two-value classification problem.The ripe algorithm of this problem of conventional process includes Support vector machine (Support Vector Machine, SVM), but the classification performance of SVM increases along with data inequality extent And reduce.It is thus desirable to after unbalanced data are converted into equalization data collection by the algorithm in data plane, then divided by SVM Class device is classified.And the algorithm in traditional data aspect is frequently with over-sampling BSMOTE based on Euclidean distance (Borderline-Synthetic Minority Oversampling Technique) algorithm or employing are based on Euclidean distance Lack sampling ODR (Optimization of Decreasing Reduction) algorithm data are processed.Although these Algorithm can improve forecasting accuracy, but the information overlap unavoidably caused because emphasizing the multiple correlation of variable, simultaneously Also cannot guarantee that the artificial sample point generated is not for noise spot.

Grant number is CN102254177B, entitled " a kind of unbalanced data SVM Bearing Fault Detection Method " domestic Invention provides a kind of unbalanced data SVM Bearing Fault Detection Method, and its weak point is that (1) algorithm uses Euclidean distance to hold Easily affected by dimensions different between sample point attribute；(2) the impurity artificial sample point generated for BSMOTE algorithm, lacks Effective minimizing technology；(3) the SVM algorithm center width advantage to promoting classification accuracy is the most fully excavated.

Summary of the invention

It is an object of the invention to solve present in prior art algorithm easily by dimensions different between sample point attribute Impact, lacks the effective minimizing technology for impurity artificial sample point and can the most fully excavate SVM algorithm center width Spend defects such as the advantages promoting classification accuracy.

To this end, the user that the present invention proposes under a kind of lack of balance IPTV data set reports barrier Forecasting Methodology.The method include as Lower step:

Step 1: importing IPTV watching record of user, record is containing the information such as user id, index, report Downtime, the present invention Method only extracts numeric type index therein, and its argument table is shown as z；

Setting the IPTV total number of users imported is D as N, the total number of records, and wherein report barrier user has N₁Individual, do not report barrier user to have N₂Individual, nth user contains D_n(n=1 .., N) bar record.Numeric type index dimension is Q, represents that numeric type index becomes with z Amount, respectively z₁,z₂,...,z_Q, each index z_qValue

Step 2: the record g after each user is obtained averagely_n(n=1 ..., N) specific as follows:

Calculate the respective average of Q index of nth user

\overset{&OverBar;}{z_{n q}} = \frac{1}{D_{n}} Σ_{d = 1}^{D_{n}} z_{d q}, (n = 1, .., N; q = 1, ..., Q)

The most each user only remains a record after pretreatmentAnd set Fixed by N₁The data set of individual minority class report barrier user's composition isBy N₂Individual most class does not report barrier user's composition Data set beThe data set of total user's composition is G=G_min∪G_maj。

Step 3: initialize equilibrium valve β based on mahalanobis distance ODR algorithm；

Determine equilibrium valve β, if equilibrium valve β value is too small, the minimizing DeGrain to most classes, otherwise, if equilibrium valve β Value is excessive, it is likely that can delete valuable most class sample, its span is 0.2≤β≤0.5 by mistake.

Step 4: use BSMOTE algorithm based on mahalanobis distance to increase artificial report barrier user sample set Y_bsmote, and determine Equilibrium valve α of BSMOTE algorithm.Then use ODR algorithm based on mahalanobis distance to reduce and do not report barrier user sample set Y_odr, it is achieved Equalization data collection G_smote+odr；

(4-1) BSMOTE based on mahalanobis distance is used to determine the artificial report barrier user sample set Y of increase_bsmote:

(4-1-1) each report barrier user data g is calculated_i∈G_minWith other user data g_j∈G(g_j≠g_iHorse between) Family name distance d (g_i,g_j)。

d (g_{i}, g_{j}) = \sqrt{{(g_{i} - g_{j})}^{T} Σ^{- 1} (g_{i} - g_{j})}, (g_{i} &NotEqual; g_{j})

Wherein, ∑^-1Covariance matrix for total user data set G.

(4-1-2) according to d (g_i,g_j) and use K arest neighbors (K-Nearest Neighbor, K-NN) algorithm to the n-th (n= 1,..,N₁) individual report barrier user determine its a series of nearest samples collection G_n-KNN, and determine affiliated sample set.

Determine the odd number K in K-NN algorithm₁Value, it is judged that the nearest samples of report barrier user is concentrated to belong to and do not reported the individual of barrier Number.

If meetingThen this report barrier user's sample is divided into Border sample set G_Border In.

If | G_n-KNN∩G_maj|=φ, then this report barrier user's sample is divided into Safe sample set G_SafeIn.

If | G_n-KNN∩G_maj|=K₁, then this report barrier user's sample is divided into Noise sample set G_NoiseIn.

Wherein G_n-KNNRepresent the K around n-th report barrier user's sample point₁Individual sample point, has

(4-1-3) statistics g_p(g_p∈G_Border) at G_minIn random K₂Neighbour's sample setAnd calculate g_pWithAttribute difference h_pk。

Statistics G_Border={ g₁,..,g_p,...,g_PEach report barrier user sample g in }_pAt sample set G_minIn random K₂ Individual arest neighborsSum in wherein P is Border sample set.Calculate sample g_pkWith this Report barrier user sample g_pBetween difference h of whole attributes_pk:

h_pk=g_p-g_pk, (p=1 ..., P；K=1 ..., K₂)

(4-1-4) to g_p(g_p∈G_Border) all generate artificial report barrier sample set Y_p。

If g_pk∈G_NoiseOr g_pk∈G_Safe, then h_pkIt is multiplied by a random number r_pk∈(0,0.5).If g_pk∈G_Border, that H_pkIt is multiplied by a random number r_pk∈ (0,1), then be each g_pThe artificial sample y generated_pk:

y_pk=g_k+|r_pk×h_pk|, (p=1 ..., P；K=1 ..., K₂)

The artificial report barrier user's sample set ultimately produced is:

(4-1-5) repeat step (4-1-3), (4-1-4), calculate G_BorderIn each report barrier user newly-increased sample set Y_p (p=1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generated_bsmote={ Y₁,...,Y_PComprise in } is newly-increased Report barrier total sample number is more than or equal to (1-β) N₂-N₁。

Wherein equilibrium valve α takes and is more than or equal toSmallest positive integral value.

(4-2) not reporting of minimizing hinders user sample set Y to use ODR based on mahalanobis distance to determine_odr:

(4-2-1) calculate each report and hinder user data g_m(g_m∈G_maj) do not report barrier user data g with other_l(g_l∈G；g_l ≠g_mMahalanobis distance d (g between)_m,g_l)。

(4-2-2) according to d (g_m,g_l) calculate G_majIn each sample g_mIncidence set

Definition incidence set C_mRefer to G_majIn except g_mThe K of other samples₃Containing g in individual arest neighbors_mSample set.Use g_mnTable Show sample point g_n(g_n∈G_maj) K₃Individual arest neighbors comprises g_mSample point, then several g_mnThe sample set C of composition_mIt is exactly g_mSample The incidence set of point.

(4-2-3) according to or without g_mTo g_mn(g_mn∈C_m) K₄The impact of-NN algorithm judgment accuracy, to g_mClassification.

Determine odd number K₄.Calculating has g_mTime, K₄-NN algorithm is to g_mn(g_mn∈C_m) correct number Num of classifying_p.Calculate nothing again g_mTime, K₄-NN is to g_mn(g_mn∈C_m) correct number Num of classifying_no-p, compare Num_pAnd Num_no-pSize, will according to following criterion g_mClassification:

Meet Num_p≤Num_no-pTime, g is described_mPlay negative interaction and be divided into Noise sample set S_NoiseIn.

Meet Num_p=Num_no-pTime, g is described_mNot essential and be divided into Safe sample set S_SafeIn.

Meet Num_p≥Num_no-pTime, g is described_mUseful and be divided into Save sample set S_SaveIn.

(4-2-4) S is preferentially deleted_Noise, next deletes S_SafeIn sample, until not reporting barrier sample set to meet condition, Whole set of data G after whole output process_smote+odr。

Definition Y_odrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from S_Noise, secondly it is S_Safe.The Y deleted_odrTotal number is more than or equal to β N₂, not reporting after i.e. processing hinders sample set { G_maj-Y_odrTotal number is less than or equal to (1-β)N₂。

After ODR and the BSMOTE algorithm of mahalanobis distance, whole set of data G_smote+odrFor:

G_smote+odr={ G_maj-Y_odr}+{G_min+Y_bsmote}

Step 5: use TOMEK algorithm to data set G_smote+odrCarry out data cleansing, the data after being cleaned G_{smote+odr+tomek}；

(5-1) G is initialized_{smote+odr+tomek}Set.

(5-2) random from G_smote+odrIn extract sample point g_i, and at G_smote+odrThe point g of middle searching arest neighbors therewith_j (g_j≠g_i)。

(5-3) at G_smote+odrMiddle searching and g_jThe point g of arest neighbors_k(g_k≠g_j)。

(5-4) g is judged_i==g_kWhether set up, if setting up, continuing executing with (5-5), otherwise making g_i=g_j,g_j=g_k, so After jump to step (5-3).

(5-5) g is judged_iAnd g_kCorresponding class of subscriber (report hinders or do not reports barrier) is the most consistent.If it is consistent, then by the two sample This point preserves to sample set G_{smote+odr+tomek}, then from G_smote+odrMiddle deletion the two sample point.If classification is inconsistent, the most directly Connect from G_smote+odrMiddle deletion the two sample point.

(5-6) judgment sample collection G_smote+odrIn number whether be the even number more than 0.If even number then repeats step (5- 2), otherwise terminate to exit.

Step 6: by G_{smote+odr+tomek}In data be brought in SVM classifier training, and with thickness step-length combine with from Adapt to adjust the core width cs of SVM classifier, find near-optimization global point, and determine the σ of correspondence_optimal；

(6-1) kernel function determining SVM classifier is gaussian kernel function

K (g_{x}, \overset{&OverBar;}{g_{x}}) = \exp (- | | g_{x} - \overset{&OverBar;}{g_{x}} | |^{2} / 2 σ^{2})

Wherein g_x∈G_{smote+odr+tomek},For g_xAverage, σ is gaussian kernel width.

(6-2) determine that model accurately passes judgment on criterion geometrical mean G-mean and F-measure:

Confusion matrix according to classified sample set

User reports barrier recall rate Recall_Min, user to report barrier precision ratio Precision_Min, user not report barrier recall rate Recall_Maj, geometrical mean G-mean and F-measure mathematic(al) representation are as follows:

Re c a l l_M i n = \frac{T P}{T P + F N}, \Pr e c i s i o n_M i n = \frac{T P}{F P + T P}, Re c a l l_M a j = \frac{T N}{T N + F P}

G - m e a n = \sqrt{Re c a l l_M i n * Re c a l l_M a j}

F - m e a s u r e = \frac{2 * Re c a l l_M i n * \Pr e c i s i o n_M i n}{Re c a l l_M i n + r e c i s i o n_M i n}

G-mean is to maintain user and reports barrier, user to maximize their precision in the case of not reporting barrier nicety of grading balance, The most only when Recall_Min and Recall_Maj is the highest when, the value of G-mean is only maximum.F- Measure index is a kind of evaluation of classification index considering recall ratio and precision ratio.F-measure can synthesis reveal point User is reported barrier and user not to report the classifying quality of barrier by class device, but more lays particular emphasis on user and report the classifying quality of barrier sample.

(6-3) SVM classifier penalty factor, core width cs, core width maximum σ are initialized_max, thick step-length, subsequently into SVM classifier computing, it is thus achieved that the optimal partial points of G-mean and F-measure.

With thick step-size change σ, obtaining each time more preferably after svm classifier result, thinner optimal partial points, until meeting σ ＞ σ_maxRear end.Now, the most optimal partial points is selected.

(6-4) from the left side of optimal partial points, core width cs is changed with thin adaptive step, as G-mean and F- When measure becomes near-optimization global point, it is thus achieved that corresponding near-optimization core width cs_optimal, and output category result.

Step 7: by IPTV user data to be predicted, is input in the detector of SVM trained, it was predicted that user reports barrier Whether, it is achieved the early warning to IPTV report barrier user.

Further, in above-mentioned steps 1, described numeric type index is containing user id, index, report Downtime information from record The numeric type index of middle extraction.

Further, in step 3, the span of equilibrium valve β is 0.2≤β≤0.5.

Further, the confusion matrix of the classified sample set in step 6-2 is:

Compared with prior art, beneficial effects of the present invention:

Improvement BSMOTE and ODR algorithm employed in the present invention are based on mahalanobis distance, not only avoid the multiple of variable The information overlap that dependency is brought, is not affected by dimensions different between sample point attribute, thus is obtained more preferably sample Data correctional effect.

BSMOTE, ODR algorithm employed in the present invention and data cleansing TOMEK algorithm, on the one hand attenuating noise point and The redundant points interference to report barrier prediction, on the other hand strengthens the contribution to correct classification of the minority effective sample point.Remove again simultaneously BSMOTE algorithm generates the impure point being difficult to differentiate between judging on svm classifier border, grader prediction is greatly improved accurately Degree.

Thickness step-length employed in the present invention combines the algorithm with self-adaptative adjustment SVM classifier core width cs, only with very Little σ loss of significance is cost, just can significantly improve the accuracy of prediction, also ensures that this algorithm possesses high operation and imitates simultaneously Rate.

Accompanying drawing explanation

Fig. 1 is the flow chart that the user under the lack of balance IPTV data set of the present invention reports barrier Forecasting Methodology.

Fig. 2 is the flow chart of adaptive strain core width S VM that the present invention relates to.

Fig. 3 is embodiment of the present invention Plays SVM and the report of tradition ODR-BSMOTE-SVM hinders the schematic diagram that predicts the outcome.

Fig. 4 is that the report of innovatory algorithm in the embodiment of the present invention hinders the schematic diagram that predicts the outcome.

Detailed description of the invention

Below in conjunction with Figure of description, the invention is described in further detail.

User in order to be better described under the lack of balance IPTV data set that the present invention relates to reports barrier Forecasting Methodology, is answered In the early warning of IPTV report barrier.Training used in the present invention and test Data Source are in Jiangsu Telecom the whole province IPTV user's Data, have 439050 user data here, relate to 4723101 viewing records, wherein comprise 4871 report barrier users, relate to Article 48172, viewing record.Additionally, also 434179 do not report barrier user, relate to 4674929 viewing records, minority class and many The uneven ratio of number class is up to 1:89.The numeric type index dimension of each watching record of user is 10, takes in this example simultaneously K₁=K₃=K₄=5, K₂=3, initial balance value β=0.3, penalty factor=1000.

According to the flow process of summary of the invention (as shown in Figure 1), start report barrier user in predicting.

In this example, IPTV total number of users N=439050 of importing, the total number of records is D=4723101, and wherein report barrier is used There is N at family₁=4871, do not report barrier user to have N₂=434179, nth user contains D_n(n=1 .., 439050) bar record, numerical value Type index dimension is Q=10, represents numeric type index variable, respectively z with z₁,z₂,...,z₁₀, each index z_qValue

Calculate the respective average of Q index of nth user

\overset{&OverBar;}{z_{n q}} = \frac{1}{D_{n}} Σ_{d = 1}^{D_{n}} z_{d q}, (n = 1, .., N; q = 1, ..., Q)

Each user only remains one after pretreatmentRecord, and Set by N₁The data set of=4871 minority class report barrier user's compositions isBy N₂=434179 majorities The data set that class does not report barrier user to form isThe data set of total user's composition is G=G_min∪G_maj。

Step 3: initialize equilibrium valve β=0.3 based on mahalanobis distance ODR algorithm；

d (g_{i}, g_{j}) = \sqrt{{(g_{i} - g_{j})}^{T} Σ^{- 1} (g_{i} - g_{j})}, (g_{i} &NotEqual; g_{j})

Wherein, ∑^-1Covariance matrix for total user data set G.

Determine the odd number K in K-NN algorithm₁=5 values, it is judged that the nearest samples of report barrier user is concentrated to belong to and do not reported barrier Number.

If meetingThen this report barrier user's sample is divided into Border sample set G_BorderIn.

If | G_n-KNN∩G_maJ|=φ, then be divided into Safe sample set G by this report barrier user's sample_SafeIn.

If | G_n-KNN∩G_maj|=5, then this report barrier user's sample is divided into Noise sample set G_NoiseIn.

(4-1-3) statistics g_p(g_p∈G_Border) at G_minIn random K₂=3 neighbour's sample setsAnd calculate g_pWithAttribute difference h_pk。

Statistics G_Border={ g₁,..,g_p,...,g_PEach report barrier user sample g in }_pAt sample set G_minIn random K₂ Individual arest neighborsCalculate sample g_pkUser sample g is hindered with this report_pBetween whole attributes Difference h_pk:

h_pk=g_p-g_pk, (p=1 ..., P；K=1 ..., K₂)

y_pk=g_k+ | r_pk×h_pk|, (p=1 ..., P；K=1 ..., K₂)

The artificial report barrier user's sample set ultimately produced is:

(4-1-5) repeat step (4-1-3), (4-1-4), calculate G_BorderIn each report barrier user newly-increased sample set Y_p (p=1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generated_bsmote={ Y₁,...,Y_PComprise in } is newly-increased Report barrier total sample number is more than or equal to (1-β) N₂-N₁=299054.

(4-2-2) according to d (g_m,g_l) calculate G_majIn each sample g_mIncidence set

Determine odd number K₄=5.Calculating has g_mTime, K₄-NN algorithm is to g_mn(g_mn∈C_m) correct number Num of classifying_p.Count again Calculate without g_mTime, K₄-NN is to g_mn(g_mn∈C_m) correct number Num of classifying_no-p, compare Num_pAnd Num_no-pSize, according to following accurate Then by g_mClassification:

(4-2-4) S is preferentially deleted_Noise, next deletes S_SafeIn sample, until not reporting barrier sample set to meet condition, Whole output process after whole set of data G_smote+odr。

Definition Y_odrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from S_Noise, secondly it is S_Safe.The Y deleted_odrTotal number is more than or equal to β N₂=130254, not reporting after i.e. processing hinders sample set { G_maj-Y_odrTotal number Less than or equal to (1-β) N₂=303925.

G_smote+odr={ G_maj-Y_odr}+{G_min+Y_bsmote}

(5-1) G is initialized_{smote+odr+tomek}Set.

(6-1) kernel function determining SVM classifier is gaussian kernel function

K (g_{x}, \overset{&OverBar;}{g_{x}}) = \exp (- | | g_{x} - \overset{&OverBar;}{g_{x}} | |^{2} / 2 σ^{2})

(6-2) determine that model accurately passes judgment on criterion geometrical mean G-mean and F-measure.

Initialize SVM classifier penalty factor=1000, core width cs=0.1, σ_max=2, thick step-length is 0.1.Now with slightly Step-size change σ, is obtaining each time more preferably after svm classifier result, is updating optimal partial points, until terminating after meeting σ ＞ 2, And obtain G-mean and F-measure optimal partial point..

As shown in Fig. 2 flow chart, after determining that thin step-length is 0.01, obtain optimal partial points σ=0.2 from the inventive method Left side starts to change σ, finally gives near-optimization global point and corresponding σ_optimal=0.21.

Performance evaluation

Result obtained by Forecasting Methodology involved in the present invention for employing is compared with correct category result, thus Can evaluate and weigh effectiveness and the accuracy of method involved in the present invention.Mark is can be seen that from (a) and (b) of Fig. 3 The optimum that quasi-SVM algorithm obtains will be near core width cs=0.3.Report now hinders and does not report the recall rate of barrier about to exist About 65%, but the value of G-mean and F-measure is the lowest, all below 0.1.From (c) and (d) of Fig. 3 permissible Find out that classifying quality relatively standard SVM of tradition ODR-BSMOTE-SVM algorithm increases, and gaussian kernel width cs be before 0.2, It is to obtain preferable G-mean and F-measure.And from (a) and (b) of Fig. 4, can be seen that the inventive method classifying quality It is substantially better than the first two algorithm, and gaussian kernel width cs is before 0.2, be to obtain good G-mean and F-measure.From (a) and (b) of Fig. 4 can be seen that, the inventive method, after meticulous step-length, can determine that core width cs can be approximated at 0.21 Optimal classification effect.The user that standard SVM, tradition ODR-BSMOTE-SVM and the inventive method record reports barrier recall rate successively For: 64.0%, 71.7%, 92.6%, user does not report barrier recall rate to be respectively as follows: 69.04%, 71.78%, 93.08%.Therefore, Use the inventive method can obtain more preferably estimated performance.

It should be noted that the above is not in order to limit the present invention, all within the spirit and principles in the present invention, institute Any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

Claims

1. the user under lack of balance IPTV data set reports barrier Forecasting Methodology, it is characterised in that comprise the steps of

Step 1: import IPTV watching record of user, extracts numeric type index, and its argument table is shown as z；

Setting the IPTV total number of users imported is D as N, the total number of records, and wherein report barrier user has N₁Individual, do not report barrier user to have N₂It is individual, Nth user contains D_n(n=1 .., N) bar record；Numeric type index dimension is Q, represents numeric type index variable with z, point Wei z₁,z₂,...,z_Q, each index z_qValue

Calculate the respective average of Q index of nth user

\overset{&OverBar;}{z_{n q}} = \frac{1}{D_{n}} Σ_{d = 1}^{D_{n}} z_{d q}, (n = 1, .., N; q = 1, ..., Q)

The most each user only remains a record after pretreatmentAnd set by N₁ The data set of individual minority class report barrier user's composition isBy N₂Individual most class does not report the data of barrier user's composition Collection isThe data set of total user's composition is G=G_min∪G_maj；

Step 4: use BSMOTE algorithm based on mahalanobis distance to increase artificial report barrier user sample set Y_bsmote, and determine Equilibrium valve α of BSMOTE algorithm；Then use ODR algorithm based on mahalanobis distance to reduce and do not report barrier user sample set Y_odr, it is achieved Equalization data collection G_smote+odr；

(4-1-1) each report barrier user data g is calculated_i∈G_minWith other user data g_j∈G(g_j≠g_iMahalanobis distance between) d(g_i,g_j)；

d (g_{i}, g_{j}) = \sqrt{{(g_{i} - g_{j})}^{T} Σ^{- 1} (g_{i} - g_{j})}, (g_{i} &NotEqual; g_{j})

Wherein, ∑^-1Covariance matrix for total user data set G；

(4-1-2) according to d (g_i,g_j) and use K-NN algorithm to the n-th (n=1 .., N₁) individual report barrier user determines that they are a series of Neighbour sample set G_n-KNN, and determine affiliated sample set；

Determine the odd number K in K-NN algorithm₁Value, it is judged that the nearest samples of report barrier user is concentrated and belonged to the number not reporting barrier；

If meetingThen this report barrier user's sample is divided into Border sample set G_BorderIn；

If | G_n-KNN∩G_maj|=φ, then this report barrier user's sample is divided into Safe sample set G_SafeIn；

If | G_n-KNN∩G_maj|=K₁, then this report barrier user's sample is divided into Noise sample set G_NoiseIn；

(4-1-3) statistics g_p(g_p∈G_Border) at G_minIn random K₂Neighbour's sample setAnd calculate g_pWithAttribute difference h_pk；

Statistics G_Border={ g₁,..,g_p,...,g_PEach report barrier user sample g in }_pAt sample set G_minIn random K₂Individual NeighbourSum in wherein P is Border sample set；Calculate sample g_pkHinder with this report User sample g_pBetween difference h of whole attributes_pk:

h_pk=g_p-g_pk, (p=1 ..., P；K=1 ..., K₂)

(4-1-4) to g_p(g_p∈G_Border) all generate artificial report barrier sample set Y_p；

If g_pk∈G_NoiseOr g_pk∈G_Safe, then h_pkIt is multiplied by a random number r_pk∈(0,0.5)；If g_pk∈G_Border, then h_pk It is multiplied by a random number r_pk∈ (0,1), then be each g_pThe artificial sample y generated_pk:

y_pk=g_k+|r_pk×h_pk|, (p=1 ..., P；K=1 ..., K₂)

The artificial report barrier user's sample set ultimately produced is:

(4-1-5) repeat step (4-1-3), (4-1-4), calculate G_BorderIn each report barrier user newly-increased sample set Y_p(p= 1 ..., P), determine equilibrium valve α of BSMOTE algorithm, until the Y generated_bsmote={ Y₁,...,Y_PThe newly-increased report barrier comprised in } Total sample number is more than or equal to (1-β) N₂-N₁；

Wherein equilibrium valve α takes and is more than or equal toSmallest positive integral value；

(4-2-1) calculate each report and hinder user data g_m(g_m∈G_maj) do not report barrier user data g with other_l(g_l∈G；g_l≠ g_mMahalanobis distance d (g between)_m,g_l)；

(4-2-2) according to d (g_m,g_l) calculate G_majIn each sample g_mIncidence set

Definition incidence set C_mRefer to G_majIn except g_mThe K of other samples₃Containing g in individual arest neighbors_mSample set；

(4-2-3) according to or without g_mTo g_mn(g_mn∈C_m) K₄The impact of-NN algorithm judgment accuracy, to g_mClassification；

Determine odd number K₄；Calculating has g_mTime, K₄-NN algorithm is to g_mn(g_mn∈C_m) correct number Num of classifying_p；Calculate without g again_mTime, K₄-NN is to g_mn(g_mn∈C_m) correct number Num of classifying_no-p, compare Num_pAnd Num_no-pSize, according to following criterion by g_mPoint Class:

Meet Num_p≤Num_no-pTime, g is described_mPlay negative interaction and be divided into Noise sample set S_NoiseIn；

Meet Num_p=Num_no-pTime, g is described_mNot essential and be divided into Safe sample set S_SafeIn；

Meet Num_p≥Num_no-pTime, g is described_mUseful and be divided into Save sample set S_SaveIn；

(4-2-4) S is preferentially deleted_Noise, next deletes S_SafeIn sample, until do not report barrier sample set meet condition, the most defeated Go out whole set of data G after processing_smote+odr；

Definition Y_odrNot reported barrier sample point set by delete, the sample point of deletion preferentially takes from S_Noise, it is secondly S_Safe；Delete The Y removed_odrTotal number is more than or equal to β N₂, not reporting after i.e. processing hinders sample set { G_maj-Y_odrTotal number is less than or equal to (1-β) N₂；

G_smote+odr={ G_maj-Y_odr}+{G_min+Y_bsmote}

(5-1) G is initialized_{smote+odr+tomek}Set；

(5-2) random from G_smote+odrIn extract sample point g_i, and at G_smote+odrThe point g of middle searching arest neighbors therewith_j(g_j≠ g_i)；

(5-3) at G_smote+odrMiddle searching and g_jThe point g of arest neighbors_k(g_k≠g_j)；

(5-4) g is judged_i==g_kWhether set up, if setting up, continuing executing with (5-5), otherwise making g_i=g_j,g_j=g_k, then jump Forward step (5-3) to；

(5-5) g is judged_iAnd g_kCorresponding class of subscriber (report hinders or do not reports barrier) is the most consistent；If it is consistent, then by the two sample point Preserve to sample set G_{smote+odr+tomek}, then from G_smote+odrMiddle deletion the two sample point；If classification is inconsistent, then directly from G_smote+odrMiddle deletion the two sample point；

(5-6) judgment sample collection G_smote+odrIn number whether be the even number more than 0；If even number then repeats step (5-2), Otherwise terminate to exit；

Step 6: by G_{smote+odr+tomek}In data be brought in SVM classifier training, and combine by thickness step-length with self adaptation Adjust the core width cs of SVM classifier, find near-optimization global point, and determine the σ of correspondence_optimal；

(6-1) kernel function determining SVM classifier is gaussian kernel function

K (g_{x}, \overset{&OverBar;}{g_{x}}) = \exp (- | | g_{x} - \overset{&OverBar;}{g_{x}} | |^{2} / 2 σ^{2})

Wherein g_x∈G_{smote+odr+tomek},For g_xAverage, σ is gaussian kernel width；

According to the confusion matrix of classified sample set, user reports barrier recall rate Recall_Min, user to report barrier precision ratio Precision_Min, user do not report barrier recall rate Recall_Maj, geometrical mean G-mean and F-measure mathematical expression Formula is as follows:

Re c a l l_M i n = \frac{T P}{T P + F N}, \Pr e c i s i o n_M i n = \frac{T P}{F P + T P}, Re c a l l_M a j = \frac{T N}{T N + F P}

G - m e a n = \sqrt{Re c a l l_M i n * Re c a l l_M a j}

F - m e a s u r e = \frac{2 * Re c a l l_M i n * \Pr e c i s i o n_M i n}{Re c a l l_M i n + r e c i s i o n_M i n}

(6-3) SVM classifier penalty factor, core width cs, core width maximum σ are initialized_max, thick step-length, subsequently into SVM Grader computing, it is thus achieved that the optimal partial points of G-mean and F-measure；

With thick step-size change σ, obtaining each time more preferably after svm classifier result, thinner optimal partial points, until meeting σ ＞ σ_maxRear end；Now, the most optimal partial points is selected；

(6-4) from the left side of optimal partial points, core width cs is changed with thin adaptive step, when G-mean with F-measure becomes During for near-optimization global point, it is thus achieved that corresponding near-optimization core width cs_optimal, and output category result；

Step 7: by IPTV user data to be predicted, is input in the detector of SVM trained, it was predicted that user report barrier with No, it is achieved the early warning to IPTV report barrier user.

User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step 1 In, described numeric type index is the numeric type index extracted from record is containing user id, index, report Downtime information.

User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step 3 In, the span of equilibrium valve β is 0.2≤β≤0.5.

User under lack of balance IPTV data set the most according to claim 1 reports barrier Forecasting Methodology, it is characterised in that step The confusion matrix of classified sample set described in 6-2 is: