CN106056160B

CN106056160B - User fault reporting prediction method under unbalanced IPTV data set

Info

Publication number: CN106056160B
Application number: CN201610392603.XA
Authority: CN
Inventors: 周亮; 吴志峰; 黄若尘; 魏昕
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2016-06-06
Filing date: 2016-06-06
Publication date: 2022-05-17
Anticipated expiration: 2036-06-06
Also published as: CN106056160A

Abstract

The invention discloses a user fault reporting prediction method under an unbalanced IPTV data set, which mainly comprises the following steps: (1) importing an IPTV user viewing record and extracting a numerical index; (2) averaging the viewing records of each user; (3) initializing a balance value beta; (4) deleting the non-fault-reporting samples by adopting ODR and BSMOTE algorithms both based on the Mahalanobis distance, and adding manual fault-reporting samples; (5) deleting newly added samples having negative influence on classification by using a TOMEK algorithm; (6) putting the reconstructed sample data set into an SVM classifier with self-adaptive variable kernel width for training; (7) and inputting IPTV user data to be predicted into a detector of the trained SVM. Because the improved BSMOTE and ODR algorithms are based on the Mahalanobis distance, the information overlapping caused by multiple correlations of variables is avoided, the influence of different dimensions among sample point attributes is avoided, a better sample data transformation effect is obtained, the interference of noise points and redundant points on fault reporting prediction is weakened, and the prediction accuracy of the classifier is greatly improved.

Description

User fault reporting prediction method under unbalanced IPTV data set

Technical Field

The invention belongs to the technical field of IPTV data analysis and processing, and particularly relates to a user fault report prediction method under an unbalanced IPTV data set.

Background

With the rapid development of multimedia communication technology, the internet Protocol television (iptv) based on broadband internet, i.e. the interactive network television, greatly facilitates the common residents to enjoy interactive, personalized, freely customized video services and value-added application services at home. However, in the video transmission process, when the Quality of Service (QoS) of the conventional network deteriorates, such as bandwidth, packet loss, delay and jitter, the viewing experience of the user may be affected to some extent, thereby causing complaint and failure report of the user. The proportion of the fault-reporting users to the whole users is small, user data inevitably becomes an unbalanced data set, and the unbalanced proportion continuously increases with the increasing maturity of the IPTV technology.

Predicting whether a user is disabled or not is a typical binary classification problem. Traditional sophisticated algorithms for dealing with this problem include Support Vector Machines (SVMs), but the classification performance of SVMs decreases as the degree of data imbalance increases. Therefore, after the unbalanced data is converted into an equalized data set through an algorithm on a data level, the equalized data is classified through an SVM classifier. While algorithms on the conventional data level often process data by using a euclidean-distance-based Oversampling BSMOTE (boundary-Synthetic minimum Oversampling Technique) algorithm or by using an undersampling odr (optimization of refining reduction) algorithm based on euclidean distance. Although these algorithms can improve the prediction accuracy, it is inevitable that information is overlapped due to multiple correlations of emphasis variables, and it is also inevitable that the generated artificial sample points are not noise points.

The domestic invention with the authorization number of CN102254177B and the name of 'an unbalanced data SVM bearing fault detection method' provides an unbalanced data SVM bearing fault detection method, and the defects are that (1) the adoption of Euclidean distance in the algorithm is easily influenced by different dimensions among sample point attributes; (2) for impurity artificial sample points generated by the BSMOTE algorithm, an effective removal method is lacked; (3) the advantage of kernel width in the SVM algorithm for improving classification accuracy is not fully exploited.

Disclosure of Invention

The invention aims to overcome the defects that the algorithm in the prior art is easily influenced by different dimensions among sample point attributes, an effective removing method for impurity artificial sample points is lacked, the advantage of kernel width in an SVM algorithm on improving classification accuracy can not be fully developed, and the like.

Therefore, the invention provides a user fault reporting prediction method under an unbalanced IPTV data set. The method comprises the following steps:

step 1: importing an IPTV user viewing record, wherein the record contains information such as user id, indexes, fault reporting time and the like, only extracting numerical indexes in the IPTV user viewing record, and expressing variables as z;

setting the total number of imported IPTV users as N and the total recorded number as D, wherein the fault-reporting user has N₁The user without fault has N₂The nth user includes D_n(N ═ 1., N) notationAnd (5) recording. The numerical index dimensions are Q, and z represents the numerical index variable and is respectively z₁,z₂,...,z_QEach index z_qValue of

Step 2: get averaged records g for each user_n(N ═ 1.., N) is specifically as follows:

calculating the respective mean value of Q indexes of the nth user

I.e. each user leaves only one record after preprocessing

And is set to be N₁The data set composed of a few fault-reporting users is

From N₂The data set composed of a plurality of users with no fault report is

The data set composed of the total users is G ═ G_min∪G_maj。

And step 3: initializing a balance value beta based on the Mahalanobis distance ODR algorithm;

and determining the balance value beta, wherein if the value of the balance value beta is too small, the reduction effect on most types is not obvious, otherwise, if the value of the balance value beta is too large, most types of valuable samples are likely to be deleted by mistake, and the value range of the balance value beta is more than or equal to 0.2 and less than or equal to 0.5.

And 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distance_bsmoteAnd determining the balance value alpha of the BSMOTE algorithm. Followed by the use ofOdr algorithm for mahalanobis distance reduction of non-fault-reporting user sample set Y_odrTo achieve a balanced data set G_smote+odr；

(4-1) determining an increased artificial failure reporting user sample set Y by using Mahalanobis distance-based BSMOTE_bsmote：

(4-1-1) calculating user data g for each error report_i∈G_minWith other user data g_j∈G(g_j≠g_i) Mahalanobis distance d (g) therebetween_i,g_j)。

Therein, sigma^-1The covariance matrix for the total user data set G.

(4-1-2) according to d (g)_i,g_j) And adopting K-Nearest Neighbor (K-NN) algorithm to carry out the comparison on the nth (N is 1₁) Determining a series of nearest neighbor sample sets G of the fault reporting user_n-KNNAnd determines the sample set to which it belongs.

Determining odd number K in K-NN algorithm₁And judging the number of the failure reporting users in the nearest neighbor sample set.

If it satisfies

Then the failure reporting user sample is divided into Border sample set G_BorderIn (1).

If | G_n-KNN∩G_majIf | ═ phi, then divide the sample of the fault-reporting user into Safe sample set G_SafeIn (1).

If | G_n-KNN∩G_maj|＝K₁Dividing the sample of the fault-reporting user into a Noise sample set G_NoiseIn (1).

Wherein G_n-KNNK around sample point of nth failed user₁A sample point is provided with

(4-1-3) statistics of g_p(g_p∈G_Border) At G_minRandom K in (1)₂Neighbor sample set

And calculate g_pAnd

property difference h of_pk。

Statistics G_Border＝{g₁,..,g_p,...,g_PEach failed user sample in_pIn sample set G_minRandom K in (1)₂A nearest neighbor

Where P is the total number in the Border sample set. Calculate sample g_pkAnd the sample g of the user reporting the fault_pAll the attribute differences h between_pk：

h_pk＝g_p-g_pk,(p＝1,...,P；k＝1,...,K₂)

(4-1-4) to g_p(g_p∈G_Border) All generate an artificial failure report sample set Y_p。

If g is_pk∈G_NoiseOr g_pk∈G_SafeThen h_pkBy a random number r_pkE (0, 0.5). If g is_pk∈G_BorderThen h_pkBy a random number r_pkE (0,1), then for each g_pGenerated artificial sample y_pk：

y_pk＝g_k+|r_pk×h_pk|,(p＝1,...,P；k＝1,...,K₂)

And the finally generated manual fault reporting user sample set comprises the following steps:

(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating G_BorderNew addition of each fault-reporting userSample set Y_p(P ═ 1.. P.), the equilibrium value α of the BSMOTE algorithm is determined until Y is generated_bsmote＝{Y₁,...,Y_PThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N₂-N₁。

Wherein the equilibrium value alpha is greater than or equal to

The smallest integer value of (c).

(4-2) determining a reduced non-reporting user sample set Y using Mahalanobis distance-based ODR_odr：

(4-2-1) calculating per-failure-reporting-user data g_m(g_m∈G_maj) User data g without reporting fault_l(g_l∈G；g_l≠g_m) Mahalanobis distance d (g) therebetween_m,g_l)。

(4-2-2) according to d (g)_m,g_l) Calculate G_majIn each sample g_mIs associated with

Define Association set C_mMeans G_majRemoving g from_mK of other samples₃Each nearest neighbor contains g_mThe sample set of (2). By g_mnRepresents the sample point g_n(g_n∈G_maj) K of₃Each nearest neighbor contains g_mSample points, then a number of g_mnSet of composed samples C_mIs g_mAn associative set of sample points.

(4-2-3) according to the presence or absence of g_mFor g_mn(g_mn∈C_m) K of₄-NN algorithm to determine the impact of accuracy, on g_mAnd (6) classifying.

Determination of odd number K₄. Calculate g_mWhen, K₄-NN algorithm pair g_mn(g_mn∈C_m) Number Num of correct classification_p. Recalculate no g_mWhen, K₄-NN pair g_mn(g_mn∈C_m) Correctly classifiedNumber Num_no-pComparison Num_pAnd Num_no-pSize, g is determined according to the following criteria_mAnd (4) classification:

satisfy Num_p≤Num_no-pDescription of (1)_mNegatively affected and divided into Noise sample set S_NoiseIn (1).

Satisfy Num_p＝Num_no-pDescription of (1)_mOptionally, dividing into Safe sample set S_SafeIn (1).

Satisfy Num_p≥Num_no-pDescription of (1)_mUseful and divided into Save sample set S_SaveIn (1).

(4-2-4) preferential deletion of S_NoiseThen delete S_SafeUntil the failure-free sample set meets the condition, finally outputting all processed data sets G_smote+odr。

Definition of Y_odrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from S_NoiseThen is S_Safe. Deleted Y_odrThe total number is more than or equal to beta N₂I.e. processed failure-free sample set { G_maj-Y_odrThe total number of (1-. beta.) N is less than or equal to₂。

After the Mahalanobis ODR and BSMOTE algorithms, the entire data set G_smote+odrComprises the following steps:

G_smote+odr＝{G_maj-Y_odr}+{G_min+Y_bsmote}

and 5: data set G using the TOMEK algorithm_smote+odrCleaning the data to obtain cleaned data G_{smote+odr+tomek}；

(5-1) initialization G_{smote+odr+tomek}And (4) collecting.

(5-2) random slave G_smote+odrIn which a sample point g is extracted_iAnd in G_smote+odrFind the nearest neighbor point g_j(g_j≠g_i)。

(5-3) at G_smote+odrMiddle search and g_jNearest neighbor point g_k(g_k≠g_j)。

(5-4) determination of g_i＝＝g_kIf yes, continue to execute (5-5), otherwise make g_i＝g_j,g_j＝g_kAnd then jumps to step (5-3).

(5-5) determination of g_iAnd g_kWhether the corresponding user categories (failure or non-failure) are consistent. If the two sample points are consistent, the two sample points are saved to a sample set G_{smote+odr+tomek}Then from G_smote+odrDeleting the two sample points. If the categories are not consistent, directly starting from G_smote+odrDeleting the two sample points.

(5-6) judgment of sample set G_smote+odrIs an even number greater than 0. If the number is even, repeating the step (5-2), otherwise ending the exit.

Step 6: g is to be_{smote+odr+tomek}The data in the step (a) is brought into an SVM classifier for training, the kernel width sigma of the SVM classifier is adaptively adjusted by combining the step length of the thickness and the step length, an approximate optimal global point is searched, and the corresponding sigma is determined_optimal；

(6-1) determining the Kernel function of the SVM classifier as a Gaussian Kernel function

Wherein g is_x∈G_{smote+odr+tomek}，

Is g_xσ is the gaussian kernel width.

(6-2) determining geometric mean values G-mean and F-measure of the accurate evaluation criterion of the model:

confusion matrix from classified sample sets

The mathematical expressions of the user fault reporting Recall rate Recall _ Min, the user fault reporting Precision _ Min, the user non-fault reporting Recall rate Recall _ Maj, the geometric mean value G-mean and the F-mean are respectively as follows:

the G-mean is to maximize the precision of the classification under the condition of keeping balance of user fault reporting and user non-fault reporting precision, namely, the value of the G-mean is maximum only when Recall _ Min and Recall _ Maj are simultaneously high. The F-measure index is a classification evaluation index comprehensively considering recall ratio and precision ratio. The F-measure can comprehensively embody the classification effect of the classifier on the user fault reporting and the user fault non-reporting, but focuses more on the classification effect of the user fault reporting sample.

(6-3) initializing penalty factor C, kernel width sigma and kernel width maximum sigma of SVM classifier_maxAnd (5) coarse step length, and then entering SVM classifier operation to obtain the optimal local points of G-mean and F-measure.

Changing sigma by coarse step length, and after obtaining a better SVM classification result each time, refining the optimal local point until sigma is more than sigma_maxAnd then the process is finished. At this time, the best local point among them is selected.

(6-4) adaptively changing the kernel width sigma in fine steps from the left side of the optimal local point, and when G-mean and F-measure become the approximate optimal global point, obtaining the corresponding approximate optimal kernel width sigma_optimalAnd outputting the classification result.

And 7: inputting IPTV user data to be predicted into a detector of the trained SVM, predicting whether the user reports the fault or not, and realizing early warning on the IPTV fault-reporting user.

Further, in the step 1, the numerical indicator is extracted from the record including the user id, the indicator, and the failure report time information.

Furthermore, in the step 3, the value range of the equilibrium value beta is more than or equal to 0.2 and less than or equal to 0.5.

Further, the confusion matrix of the classification sample set in step 6-2 is:

compared with the prior art, the invention has the beneficial effects that:

the improved BSMOTE and ODR algorithms adopted in the invention are both based on the Mahalanobis distance, so that not only is information overlapping caused by multiple correlations of variables avoided, but also the influence of different dimensions among sample point attributes is avoided, and a better sample data transformation effect is obtained.

The BSMOTE, ODR algorithm and the data cleaning TOMEK algorithm adopted in the invention weaken the interference of noise points and redundant points on fault prediction on one hand, and strengthen the contribution of a few effective sample points on correct classification on the other hand. Meanwhile, impurity points which are generated by the BSMOTE algorithm and are difficult to distinguish and judge on SVM classification boundaries are eliminated, and the prediction accuracy of the classifier is greatly improved.

The method adopts the coarse step length and the fine step length combined with an algorithm for adaptively adjusting the kernel width sigma of the SVM classifier, can obviously improve the accuracy of prediction at the cost of small sigma precision loss, and simultaneously can ensure that the algorithm has high operation efficiency.

Drawings

Fig. 1 is a flowchart of a user failure prediction method under an unbalanced IPTV data set according to the present invention.

Fig. 2 is a flowchart of the adaptive variable kernel width SVM according to the present invention.

Fig. 3 is a diagram illustrating the failure reporting prediction results of the standard SVM and the conventional ODR-BSMOTE-SVM according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a failure prediction result of the improved algorithm according to the embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

In order to better illustrate the user fault reporting prediction method under the unbalanced IPTV data set, the method is applied to the early warning of IPTV fault reporting. The training and testing data used in the present invention was derived from data of IPTV users in the whole telecommunications province of Jiangsu, where 439050 user data relate to 4723101 viewing records, including 4871 reporting users and 48172 viewing records. In addition, there were 434179 non-reporting users, involving 4674929 viewing records, with an imbalance ratio of up to 1:89 for both the minority and majority classes. The numeric index dimension for each user viewing the record is 10, while taking K in this example₁＝K₃＝K₄＝5，K₂The initial balance β is 0.3, and the penalty factor C is 1000.

According to the flow of the inventive concept (shown in fig. 1), the disabled user prediction is started.

Step 1: importing an IPTV user viewing record, and recording information including a user id, an index, fault reporting time and the like, wherein only the numerical index is extracted, and the variable is represented as z;

in this example, the total number of imported IPTV users N is 439050, and the total number of records D is 4723101, where the failed users have N₁4871, the non-disabled user has N₂434179, the nth user contains D_n(n-1.., 439050) records, each of the numerical index dimensions Q-10, and the numerical index variables are represented by z, each z being a value of z₁,z₂,...,z₁₀Each index z_qValue of

calculating the respective mean value of Q indexes of the nth user

Each user is preprocessed to leave only one user

Recording, and setting from N₁The data set formed by 4871 minority fault-reporting users is

From N₂434179 data sets composed of most users who do not report faults

The data set composed of the total users is G ═ G_min∪G_maj。

And step 3: initializing a balance value beta of 0.3 based on the Mahalanobis distance ODR algorithm;

and 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distance_bsmoteAnd determining the balance value alpha of the BSMOTE algorithm. Then, reducing the sample set Y of the users without fault reporting by adopting the Mahalanobis distance-based ODR algorithm_odrTo achieve a balanced data set G_smote+odr；

Therein, sigma^-1The covariance matrix for the total user data set G.

(4-1-2) according to d (g)_i,g_j) And adopting K-Nearest Neighbor (K-NN) algorithm to carry out the comparison on the nth (N is 1₁) A fault-reporting user determines one of themSeries nearest neighbor sample set G_n-KNNAnd determines the sample set to which it belongs.

Determining odd number K in K-NN algorithm₁And (5) judging the number of the failure reporting users in the nearest neighbor sample set.

If it satisfies

| G_n-KNN∩G_maj | ═ phi, the failure reporting user sample is divided into Safe sample set G_SafeIn (1).

| G_n-KNN∩G_maj| 5, the failing user sample is divided into Noise sample set G_NoiseIn (1).

(4-1-3) statistics of g_p(g_p∈G_Border) At G_minRandom K in (1)₂3 neighbor sample set

And calculate g_pAnd

property difference h of_pk。

Calculate sample g_pkAnd the sample g of the user reporting the fault_pAll the attribute differences h between_pk：

h_pk＝g_p-g_pk,(p＝1,...,P；k＝1,...,K₂)

(4-1-4) for g_p(g_p∈G_Border) All generate an artificial failure report sample set Y_p。

y_pk＝g_k+｜r_pk×h_pk｜,(p＝1,...,P；k＝1,...,K₂)

(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating G_BorderNewly added sample set Y of each fault-reporting user_p(P1.. P.) the equilibrium value a of the BSMOTE algorithm is determined until Y is generated_bsmote＝{Y₁,...,Y_PThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N₂-N₁＝299054。

Wherein the equilibrium value alpha is greater than or equal to

The smallest integer value of (c).

(4-2-1) calculating the per-failure-reporting-user data g_m(g_m∈G_maj) User data g without reporting fault_l(g_l∈G；g_l≠g_m) Mahalanobis distance d (g) therebetween_m,g_l)。

(4-2-3) according to the presence or absence of g_mFor g to_mn(g_mn∈C_m) K of₄-NN algorithm to determine the impact of accuracy, on g_mAnd (6) classifying.

Determination of odd number K₄5. Calculate g_mWhen, K₄-NN algorithm pair g_mn(g_mn∈C_m) Number Num of correct classification_p. Recalculate no g_mWhen, K₄-NN pair g_mn(g_mn∈C_m) Number Num of correct classification_no-pComparison Num_pAnd Num_no-pSize, g is determined according to the following criteria_mAnd (4) classification:

Definition of Y_odrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from S_NoiseThen is S_Safe. Deleted Y_odrThe total number is more than or equal to beta N₂130254, i.e. processed failure-free sample set G_maj-Y_odrThe total number of (1-. beta.) N is less than or equal to₂＝303925。

G_smote+odr＝{G_maj-Y_odr}+{G_min+Y_bsmote}

(5-1) initialChemical formula G_{smote+odr+tomek}And (4) collecting.

(5-2) random slave G_smote+odrIn which a sample point g is extracted_iAnd is in G_smote+odrFind the nearest neighbor point g_j(g_j≠g_i)。

(6-2) determining the geometric mean values G-mean and F-measure of the accurate evaluation criteria of the model.

The penalty factor C of the initialized SVM classifier is 1000, the kernel width sigma is 0.1, and sigma is_maxThe coarse step size is 0.1, 2. And changing sigma by using the coarse step length, updating the optimal local point after obtaining a better SVM classification result each time until the sigma is more than 2, and obtaining the optimal local points of G-mean and F-measure. .

As shown in the flowchart of fig. 2, after the fine step size is determined to be 0.01, σ is changed from the left side where the optimal local point σ obtained by the method of the present invention is 0.2, and finally the approximate optimal global point and the corresponding σ are obtained_optimal＝0.21。

Evaluation of Performance

The result obtained by adopting the prediction method provided by the invention is compared with the correct classification result, so that the effectiveness and the accuracy of the method provided by the invention can be evaluated and measured. It can be seen from (a) and (b) of fig. 3 that the optimum point obtained by the standard SVM algorithm will be around the kernel width σ of 0.3. The recall rate of fault reporting and fault non-reporting is about 65%, but the values of G-mean and F-measure are generally low and are below 0.1. From (c) and (d) of FIG. 3, it can be seen that the classification effect of the conventional ODR-BSMOTE-SVM algorithm is improved compared with the standard SVM, and the Gaussian kernel width σ is 0.2 or less, so that better G-mean and F-mean can be obtained. It can be seen from (a) and (b) of FIG. 4 that the classification effect of the method of the present invention is significantly better than that of the first two algorithms, and the Gaussian kernel width σ is 0.2 or less, so that good G-mean and F-mean can be obtained. It can be seen from (a) and (b) of fig. 4 that the method of the present invention can determine that the kernel width σ is 0.21 to obtain the near-optimal classification effect after the fine step size. The standard SVM, the traditional ODR-BSMOTE-SVM and the user fault reporting recall rate measured by the method of the invention are as follows in sequence: 64.0%, 71.7% and 92.6%, and the user non-obstacle recall rates are respectively as follows: 69.04%, 71.78%, 93.08%. Therefore, better prediction performance can be obtained by adopting the method.

It should be understood that the above description is not intended to limit the present invention, and any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A user failure reporting prediction method under an unbalanced IPTV data set is characterized by comprising the following steps:

step 1: importing an IPTV user viewing record, and extracting a numerical index, wherein the variable is represented as z:

setting the total number of imported IPTV users as N and the total recorded number as D, wherein the fault-reporting user has N₁The user without fault has N₂The nth user includes D_nA bar record, N ═ 1.., N; the dimensionality of the recorded numerical indexes is Q, and z represents a numerical index variable and is respectively z₁,z₂,…,z_q,…,z_QEach index z_qValue of

Step 2: get averaged records g for each user_nN is 1, …, and N is as follows:

calculating the respective mean value of Q indexes of the nth user

I.e. each user leaves only one record after preprocessing

And is set to be N₁The data set composed of a few fault-reporting users is

From N₂The data set composed of a plurality of users with no fault report is

The data set composed of the total users is G ═ G_min∪G_maj；

and 4, step 4: method for increasing sample set Y of manual fault reporting user by adopting BSMOTE algorithm based on Mahalanobis distance_bsmoteDetermining a balance value alpha of the BSMOTE algorithm; then, reducing the sample set Y of the users without fault reporting by adopting the Mahalanobis distance-based ODR algorithm_odrTo achieve a balanced data set G_smote+odr：

(4-1-1) calculating user data g for each error report_i∈G_minWith other user data g_jE.g. the mahalanobis distance d (G) between G_i,g_j) Wherein g is_i≠g_j；

Therein, sigma^-1A covariance matrix for the total user data set G;

(4-1-2) according to d (g)_i,g_j) And determining a series of nearest neighbor sample sets G of the nth fault reporting user by adopting a K-NN algorithm_n-KNNAnd determining the sample set, wherein N is 1, … N₁：

Determining odd number K in K-NN algorithm₁Value, judge and report failureThe number of the nearest neighbor sample sets of the user which belong to the non-fault reporting number;

if it satisfies

Then the failure reporting user sample is divided into Border sample set G_BorderPerforming the following steps;

if | G_n-KNN∩G_majIf | ═ phi, then divide the sample of the fault-reporting user into Safe sample set G_SafePerforming the following steps;

if | G_n-KNN∩G_maj|＝K₁Dividing the sample of the fault-reporting user into a Noise sample set G_NoisePerforming the following steps;

(4-1-3) statistics of g_pAt G_minRandom K in (1)₂Neighbor sample set G_min-K2NNAnd calculate g_pAnd g_pkProperty difference h of_pkWherein g is_p∈G_Border，

Wherein P is the total number in the Border sample set; calculate sample g_pkAnd the sample g of the user reporting the fault_pAll the attribute differences h between_pk：

h_pk＝g_p-g_pkWherein P is 1, …, P; k1, … K₂

(4-1-4) for g_pAll generate an artificial fault report sample set Y_p，g_p∈G_Border,：

If g is_pk∈G_NoiseOr g_pk∈G_SafeThen h_pkBy a random number r_pkE (0, 0.5); if g is_pk∈G_BorderThen h_pkBy a random number r_pkE (0,1), then for each g_pGenerated artificial sample y_pk：

y_pk＝g_p+|r_pk×h_pk|,p＝1,…,P；k＝1,…,K₂

(4-1-5) repeating the steps (4-1-3) and (4-1-4), and calculating G_BorderNewly added sample set Y of each fault-reporting user_pP1, … P, determining the equilibrium value α of the BSMOTE algorithm until Y is generated_bsmote＝{Y₁,...,Y_PThe total number of newly added failure reporting samples contained in the data is more than or equal to (1-beta) N₂-N₁；

Wherein the equilibrium value alpha is greater than or equal to

The smallest integer value of (c);

(4-2-1) calculating the per-failure-reporting-user data g_mUser data g without reporting fault_lMahalanobis distance d (g) therebetween_m,g_l) Wherein g is_m∈G_maj,g_l∈G_majAnd g is_l≠g_m；

Define Association set C_mMeans G_majRemoving g from_mK of other samples₃Each nearest neighbor contains g_mThe sample set of (1);

(4-2-3) according to the presence or absence of g_mFor g_mnThe influence of the judgment accuracy of the K-NN-algorithm on g_mClassification therein of g_mn∈C_m：

Determining that K of the K-NN algorithm in the step is an odd number K₄(ii) a Calculate g_mThe K-NN algorithm pair g_mnNumber Num of correct classification_p(ii) a Recalculate no g_mThe K-NN algorithm pair g_mnNumber Num of correct classification_no-pComparison Num_pAnd Num_no-pSize, g is determined according to the following criteria_mAnd (4) classification:

satisfy Num_p＜Num_no-pDescription of (1)_mNegatively affected and divided into Noise sample set S_NoisePerforming the following steps;

satisfy Num_p＝Num_no-pDescription of (1)_mIf necessary, divided into Safe sample sets S_SafePerforming the following steps;

satisfy Num_p＞Num_no-pDescription of (1)_mUseful and divided into Save sample set S_SavePerforming the following steps;

(4-2-4) preferential deletion of S_NoiseThen delete S_SafeUntil the failure-free sample set meets the condition, finally outputting all processed data sets G_smote+odr：

Definition of Y_odrFor the deleted set of non-failing sample points, the deleted sample points are preferentially taken from S_NoiseThen is S_Safe(ii) a Deleted Y_odrThe total number is more than or equal to beta N₂I.e. processed failure-free sample set { G_maj-Y_odrThe total number of the (1-beta) N is less than or equal to₂；

G_smote+odr＝{G_maj-Y_odr}+{G_min+Y_bsmote}

and 5: data set G using the TOMEK algorithm_smote+odrCleaning the data to obtain cleaned data G_{smote+odr+tomek}：

(5-1) initialization G_{smote+odr+tomek}Gathering;

(5-2) random slave G_smote+odrIn which a sample point g is extracted_iAnd is combined withAt G_smote+odrFind the nearest neighbor point g_j,g_j≠g_i；

(5-3) at G_smote+odrMiddle search and g_jNearest neighbor point g_k,g_k≠g_j；

(5-4) determination of g_i＝＝g_kIf yes, continue to execute (5-5), otherwise make g_i＝g_j,g_j＝g_kThen jumping to the step (5-3);

(5-5) determination of g_iAnd g_kWhether the corresponding user fault reporting categories are consistent or not; if the two sample points are consistent, the two sample points are saved to a sample set G_{smote+odr+tomek}Then from G_smote+odrDeleting the two sample points; if the categories are not consistent, directly starting from G_smote+odrDeleting the two sample points;

(5-6) judgment of sample set G_smote+odrWhether the number of (2) is an even number greater than 0; if the number is an even number larger than 0, returning to the step (5-2), otherwise, ending the exit;

step 6: g is to be_{smote+odr+tomek}The data in the step (a) is brought into an SVM classifier for training, the kernel width sigma of the SVM classifier is adaptively adjusted by combining the step length of the thickness and the step length, an approximate optimal global point is searched, and the corresponding sigma is determined_optimal：

Wherein g is_x∈G_{smote+odr+tomek，}

Is g_xσ is the gaussian kernel width;

wherein TN is a value of classifying as not reporting fault and predicting as not reporting fault for the user, FN is a value of classifying as reporting fault and predicting as not reporting fault for the user, FP is a value of classifying as not reporting fault and predicting as reporting fault for the user, and TP is a value of classifying as reporting fault and predicting as reporting fault for the user;

(6-3) initializing penalty factor C, kernel width sigma and kernel width maximum sigma of SVM classifier_maxCoarse step length, then entering SVM classifier operation to obtain the optimal local points of G-mean and F-measure;

changing sigma by coarse step length, updating the best local point after obtaining better SVM classification result each time until sigma > sigma_maxThen ending; at this time, the best local point is selected;

(6-4) adaptively changing the kernel width sigma in fine steps from the left side of the optimal local point, and when G-mean and F-measure become the approximate optimal global point, obtaining the corresponding approximate optimal kernel width sigma_optimalAnd outputting the classification result;

2. The method as claimed in claim 1, wherein in step 1, the numerical indicator is extracted from a record containing user id, indicator and failure reporting time information.

3. The method as claimed in claim 1, wherein in step 3, the value of the equilibrium value β is 0.2-0.5.