CN107590263B

CN107590263B - Distributed big data classification method based on multivariate decision tree model

Info

Publication number: CN107590263B
Application number: CN201710864745.6A
Authority: CN
Inventors: 张宇
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2020-07-07
Anticipated expiration: 2037-09-22
Also published as: CN107590263A

Abstract

A distributed big data classification method based on a multivariate decision tree model comprises the following steps: the local node classifies the random online arriving unknown class label samples by using an integrated classifier shared by the central node, and stores the known class label samples with the credibility exceeding a preset threshold into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and emptying the data set; combining data sets sent by local nodes by a central node to generate a training sample set, training a multivariate decision tree model based on geometric contour similarity, adding the multivariate decision tree model into an integrated classifier as a base classifier, and periodically updating the integrated classifier; and sharing the integrated classifier to the local node, and classifying the online arriving streaming big data by the local node by using the integrated classifier. The multivariate decision tree based on the similarity of the geometric outlines is applied to an integrated classifier, and the problem of distribution flow type big data classification of the normalized data form is effectively solved.

Description

Distributed big data classification method based on multivariate decision tree model

Technical Field

The invention relates to the technical field of big data classification, in particular to a distributed big data classification method based on a multivariable decision tree model.

Background

Classification is one of the important tasks of data mining, and is also a problem of extensive research in related fields such as machine learning, pattern recognition and artificial intelligence. Classification has a wide range of practical applications including medical diagnosis, credit assessment, selective shopping, face recognition, etc.

The rapid development of emerging information technologies and application modes such as cloud computing, internet of things, mobile interconnection, social media and the like promotes the rapid increase of global data volume and promotes the human society to enter a big data era. Big data contains big information, the big information refines big knowledge, and the big knowledge can help users improve insight and decision making power in a higher layer, a wider visual angle and a wider range, and can create unprecedented great value for human society. At the same time, however, the great total value is often hidden in large data, and the vivid features of extremely low value density, extremely irregular distribution, extremely deep information hiding degree, and extremely difficult to find useful value are exhibited. Classification mining of such large data faces many challenges compared to traditional data classification. Firstly, the traditional classification mining method is based on a single learning sample set, and the distributed collection characteristic of big data determines that the classification learning needs to be carried out in a distributed mode, so that the corresponding distributed learning strategy and method need to be researched; secondly, the streaming big data flowing dynamically is obviously different from the static data stored in the traditional database, all data cannot be stored at one time and then off-line mining is carried out, and an on-line real-time collection technology and an incremental mining method changing along with time must be explored; finally, the traditional classification mining technology has high requirements on the learning sample set, and the classification mining of distributed and streaming big data needs multi-node and multi-step cooperative processing, so that the purity of the learning sample set is difficult to ensure, and a classification technology with good robustness has to be explored aiming at the mining characteristics of the big data.

For the classification problem of data flow, the classification model based on ensemble learning is a better solution because it has high capability of resisting concept drift and high classification accuracy. However, in the face of distributed data stream big data classification, the existing integrated classification model based on the decision tree faces an urgent problem to be solved: however, the existing integrated classifier based on the decision tree mostly adopts the univariate decision tree as the base classifier, and the univariate decision tree can only generate the decision boundary parallel to the coordinate axis, so a large number of base classifiers are needed to correctly approximate the class boundary, which reduces the learning performance and the prediction efficiency of the integrated classification model, and is difficult to adapt to the application of intrusion detection and the like which needs rapid prediction.

Disclosure of Invention

In order to solve the problems, the invention provides a distributed big data classification method based on a multivariable decision tree model.

In order to achieve the purpose, the invention adopts the technical scheme that:

a distributed big data classification method based on a multivariate decision tree model comprises the following steps:

the local node classifies the random online arriving unknown class label samples by using an integrated classifier shared by the central node, and stores the samples with known class labels and the credibility exceeding a preset threshold value into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and then emptying the data set;

the central node combines the data sets sent by the local nodes to generate a training sample set, and the training sample set is utilized to train a multivariable decision tree model based on geometric contour similarity;

the central node adds the multivariable decision tree model as a base classifier into the integrated classifier, and periodically updates the integrated classifier;

the central node shares the integrated classifier to the local node, and the local node classifies the online arriving streaming big data by using the integrated classifier.

The training of the multivariate decision tree model based on the similarity of the geometric outlines by using the training sample set specifically comprises the following steps:

projecting different types of sample points in the m-dimensional space to a numerical axis of a one-dimensional space by using a geometric contour similarity function, wherein the upper and lower boundaries of projection point sets of different types are category projection boundaries of different types of sample data;

sorting and grouping the projection points on the axes of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets, dividing the ordered projection sets into a plurality of subsets, and marking the projection points in the difference set as leaf nodes; marking the projection points in the intersection as intermediate nodes;

and under the guidance of the optimal reference vector, determining a multivariate decision tree model by adopting a recursive projection splitting method.

Under the guidance of the optimal reference vector, determining a multivariate decision tree model by adopting a recursive projection splitting method, comprising the following steps of:

calculating a reference vector, namely an optimal reference vector, when the intersection of the projection point subsets is minimum, namely the deviation of the geometric outline similarity between the projection point subsets is maximum;

re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a father node of the new ordered projection point set, and re-calculating a reference vector which is the optimal reference vector when the intersection of all subsets in the current projection point sets is minimum, namely the geometric outline similarity deviation between the projection point subsets is maximum;

and if the intersection set between the projection point subsets is empty or the number of the sample points in the intersection set between the projection point subsets is less than a preset threshold value, stopping splitting to obtain the final multivariable decision tree model.

The geometric contour similarity function measures the similarity between multi-dimensional objects by using the similarity between samples and between characteristic variables in the samples.

Has the advantages that:

the effectiveness of the technology is proved through theoretical analysis and numerical verification, and the remarkable effects are highlighted in the following aspects:

(1) the optimal reference vector is solved based on the method for maximizing the similarity of the geometric profiles among the categories, so that the intersection of projection point sets of different categories is minimum under the guidance of the optimal reference vector, the minimization of a sample set with uncertain category attribution is realized, and the category boundary can be accurately learned. In addition, the class boundary is determined through the position relation among the sets, so that the learning process is less influenced by abnormal projection points, the robustness is better, and the class boundary can be accurately identified in a distributed streaming type big data environment with low data purity.

(2) Aiming at the projection overlapping area in the father node, the optimal reference vector is recalculated at the child node, so that the sample points in the overlapping area are correctly separated after being re-projected, the splitting problem of the projection overlapping area is solved, the integral splitting times are reduced, the training time of the decision tree is further reduced, and the construction of an integrated classification model facing the distributed data stream big data based on the decision tree becomes possible.

(3) The multivariable decision tree based on the similarity of the geometric outlines can generate decision boundaries of any angle, and compared with a univariate decision tree, the multivariable decision tree based on the similarity of the geometric outlines has stronger representation capability. Under the condition of representing the same decision boundary, compared with a univariate decision tree, the decision tree needs fewer base classifiers, so that the problem of the performance reduction of learning and prediction caused by the addition of the base classifiers in distributed big data classification can be effectively solved.

(4) The numerical experiment results show that: the multivariable decision tree based on the geometric contour similarity has high classification precision and low training time, and effectively combines the advantages of high learning efficiency of the univariate decision tree and strong representation capability of the multivariable decision tree. The multivariate decision tree based on the similarity of the geometric outlines is applied to the integrated classifier, so that the problem of classification of distributed flow type big data in a normalized data form can be effectively solved.

Drawings

Fig. 1 shows the variation of classification accuracy with the size of a sliding window in the embodiment of the present invention, (a) shows the variation of classification accuracy with the size of a sliding window for four classifiers in KDDCUP99 dataset, (b) shows the variation of classification accuracy with the size of a sliding window for four classifiers in a hetereogenetic Activity dataset, and (c) shows the variation of classification accuracy with the size of a sliding window for four classifiers in a Record Linkage dataset;

fig. 2 is a variation of the mining sequence of classification accuracy across the entire KDDCUP99 dataset when wt ═ 5(wt represents the sliding window size), (a) is a variation of the classification accuracy of EGODT across the mining sequence of the entire KDDCUP99 dataset, (b) is a variation of the classification accuracy of EC4.5 across the mining sequence of the entire KDDCUP99 dataset, (c) is a variation of the classification accuracy of ECart-LC across the mining sequence of the entire KDDCUP99 dataset, and (d) is a variation of the classification accuracy of EHoeffdingTree across the mining sequence of the entire KDDCUP99 dataset in the embodiment of the present invention;

fig. 3 is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy when wt ═ 5 in the embodiment of the present invention, (a) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of EGODT, (b) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of EC4.5, (c) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of ECart-LC, (d) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of ehoeeffdingtree;

fig. 4 is a variation of classification accuracy over the mining sequence of Record link dataset when wt is 5, (a) is a variation of classification accuracy of EGODT over the mining sequence of Record link dataset, (b) is a variation of classification accuracy of EC4.5 over the mining sequence of Record link dataset, (c) is a variation of classification accuracy of ECart-LC over the mining sequence of Record link dataset, and (d) is a variation of classification accuracy of EHoeffdingTree over the mining sequence of Record link dataset in the embodiment of the present invention;

fig. 5 shows the variation of training time with the size of the sliding window in the embodiment of the present invention, (a) is the variation of training time with the size of the sliding window for KDDCUP99 data set, (b) is the variation of training time with the size of the sliding window for hetereogenitylactivity data set, and (c) is the variation of training time with the size of the sliding window for Record Linkage data set;

fig. 6 shows the variation of classification accuracy with the number of basis classifiers in the embodiment of the present invention, (a) is the variation of classification accuracy with the number of basis classifiers for KDDCUP99 data sets, (b) is the variation of classification accuracy with the number of basis classifiers for hetereogenitauty activity data sets, and (c) is the variation of classification accuracy with the number of basis classifiers for Record Linkage data sets;

FIG. 7 is a flowchart of a distributed big data classification method based on a multivariate decision tree model according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A distributed big data classification method based on multivariate decision tree model, as shown in fig. 7, includes:

step 1, local node L_z(

z

1, 2.. R.) samples of unknown class labels that arrive randomly online are classified and labeled with class labels using an integrated classifier shared by the central node G, and samples of known class labels with confidence levels exceeding a preset threshold are stored in a data set D_z. When the data set D_zIs sent to the central node G when the capacity of (D) exceeds a preset threshold, and then the data set D is emptied_zAnd R is the number of local nodes. (initially, the training sample set with known classification labels is sampled for R times in a layering way, and the sampled data is respectively stored in a data set D₁，D₂，......，D_R)

Step 2, the central node G merges the data sets D sent by the local nodes₁，D₂，......，D_RGenerating a training sample set, and recording as T ═ X₁，X₂，....，X_n) Wherein X is_iLet X be the ith sample of T (i ═ 1, 2.. times., n)_i＝(x_i1，x_i2，.....，x_im)，x_ijIs X_iThe j-th characteristic variable (j ═ 1, 2.... times.m); and training the multivariate decision tree model based on the similarity of the geometric outlines by using the training sample set T.

The step 2 specifically comprises the following substeps:

step 2-1, projecting different types of sample points in an m-dimensional space to a numerical axis of a one-dimensional space by using a geometric contour similarity function, wherein the upper and lower boundaries of projection point sets of different types are category projection boundaries of different types of sample data;

step 2-2, sorting and grouping the projection points on the number axis of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets P₁，P₂，......，P_k(k is the number of classes in the training sample), the ordered set of projections is divided into a plurality of subsets, where P₁，P₂，......，P_kThe intersection of (A) is denoted as T_p1，T_p2，......，T_ps；P₁，P₂，......，P_kIs recorded as T_q1，T_q2，......，T_qt. Due to difference set T_qrThe projection points in (

r

1, 2.... t) belong to the same class, so the projection points in the difference set are labeled as leaf nodes; intersection T_plThe projection points in (1, 2...., s) are not in the same class, and thus the projection points in the intersection are labeled as intermediate nodes.

Step 2-3, in the optimal reference vector S_bUnder the guidance of (1), determining a multivariate decision tree model by adopting a recursive projection splitting method;

the step 2-3 specifically comprises the following substeps:

step 2-3-1, calculating a reference vector, namely an optimal reference vector, when the intersection of all the projection point subsets is minimum, namely the similarity of the geometric outlines among the projection point subsets is maximum;

define sample X_cAnd X_dGeometric profile similarity function ρ (X ≠ d) of (c, d ≠ 1, 2.,. n, and c ≠ d)_c，X_d) Comprises the following steps:

wherein, variable

Variables of

Variables of

Any two kinds (C)_f、C_g) Projection of sample points onto a subset of projection points (P) in one-dimensional space_fAnd P_g) Calculate the optimal reference vector S when the intersection of (A) and (B) is minimum_b。

Is provided with C_fMean vector of middle training sample is

C_gMean vector of middle training sample is

And

has a deviation vector of

Let v equal (w)₁a₁，w₂a₂，....，w_ma_m)^TWherein w is₁，w₂，....，w_mAs weight coefficients, reference vectors

Then S and

has a deviation of the geometric profile similarity of

According to a set w that maximizes f₁，w₂，....，w_mTaking values and calculating an optimal reference vector

Or

Albeit at the optimal reference vector S_bGuided lower intersection T of_plMinimum, but T_plNot equal to Φ. If T is split directly according to the category attribution of the projection point_plThis results in splitting imbalance and reduced splitting efficiency, with each subset after splitting containing very few samples of the same class (extreme cases, even lessTo only one sample) that are split up until termination, the final decision tree is very complex and over-fit is severe.

Step 2-3-2, re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a parent node of the new ordered projection point set, and re-calculating a reference vector which is the optimal reference vector S 'when the intersection of all subsets in all the projection point sets is minimum, namely the similarity of geometric outlines among the projection point subsets is maximum'_b；

And 2-3-3, if the intersection between the projection point subsets is empty or the number of sample points in the intersection between the projection point subsets is less than a preset threshold value, stopping splitting to obtain a final multivariable decision tree model.

And 3, adding the multivariate decision tree model established in the step 2 as a base classifier into the integrated classifier by the central node, periodically updating the integrated classifier, setting the number of the base classifiers of the integrated classifier as N, and assuming that the current mining time point is t_hThe integrated classifier generated by the last mining point is

(

Is ec_h-1Medium u base classifier), sliding window wd_hThe lower sample training set is T_iIf M is less than N, using T_iTraining M +1 th base classifier

If M > N, use T_iTraining ec_h-1The medium-class base classifier with the highest classification error rate can ensure the difference among the base classifiers and realize the training of the base classifiers under the same condition.

And 4, the central node shares the integrated classifier to the local node, and the local node classifies the online arrived streaming big data by using the integrated classifier.

3 experiments are designed in total under the distributed data flow big data environment to verify the method of the invention:

experiment 1, testing the classification precision and training time of an integrated classifier egodt (ensemble of geometrical outlining information Decision Tree) of a multivariate Decision Tree (geometrical outline similarity Tree) based on Geometric contour similarity, and selecting a base classifier in related work: c4.5, constructing a contrast integration classifier by using HoeffingTree and Cart-LC: EC4.5, EHoeffdingTree, and ECart-LC;

experiment 2: testing the classification precision of the 4 kinds of integrated classifiers under different numbers of the base classifiers;

experiment 3: the degree of diversity of the base classifier GODT was tested and compared to C4.5, HoeffingTree, and Cart-LC.

Setting an experimental environment:

the experimental Distributed environment consists of 4 local nodes and 1 central Node, 5 virtual machines are constructed on 1 Workstation (CPU: E5-2620, memory: 40GB) by utilizing VM work, each virtual machine allocates 4G memories, 4 of the virtual machines are utilized as local nodes (Distributed nodes), 1 of the virtual machines is utilized as a central Node (Master Node), KDDCUP99 training Data sets and testing Data sets are Distributed and stored in the 4 local nodes, and a Stream Generator is deployed at each local Node to simulate the online arrival process of a Data Stream so as to form window Data, in order to better simulate the random arrival condition of the Data, the Data flow rate of each local Node can change along with the time, the change range is set to [1000, 3000]. a Data Collector and a global classifier are deployed at the central Node, the Data Collector collects the training Data for the global classifier, and the global classifier is responsible for training the classifier and distributes the generated classifier to the local nodes, and the local nodes replace the existing classifier by using the latest classifier and continue to classify the online arrived test data.

The real data sets KDDCUP99, Record link and heterogeneous Activity were selected as test data sets, and the details of the data are shown in table 1.

TABLE 1 data set

Comparative analysis of experimental results

1) Precision testing of integrated classifiers

As can be seen from fig. 1(a) - (c), on KDDCUP99 and the heterogeneous Activity dataset, as wt increases, the precision of 4 classifiers gradually increases, because the amount of training data is proportional to wt, and the larger wt, the more training data, the better learning quality of 4 classifiers; however, such a learning characteristic is not obvious on Record Linkage data set, because the data set is a binary data set, the class boundary is relatively simple, the training data amount at wt ═ 5 satisfies the generalization learning of most class boundaries, and therefore, the classification precision is improved only slightly with the increase of wt. In addition, compared with EC4.5 and EHoeffdingTree, the classification accuracy of EGODT and ECart-LC is better, and when wt is 5, the classification accuracy of EGODT is the highest, and at this time, the training data is the least, which represents the learning advantage of EGODT in the environment of small training set, because GODT is a decision tree constructed based on class boundary, each branch arc of non-leaf node corresponds to a split boundary, and the split boundaries are determined by class boundary of projection point.

To further illustrate this learning characteristic of the GODT, fig. 2(a) - (d), fig. 3(a) - (d), and fig. 4(a) - (d) show the learning process of the entire mining sequence for 4 ensemble classifiers at wt ═ 5. In the initial stage of the mining sequence of the KDDCUP99 data set, the classification accuracy of all 4 classifiers fluctuates greatly, which is caused by the updated learning strategy of the classifiers, because the number of base classifiers in the integrated classifiers is small in the initial stage, and the difference between the base classifiers is large, so that the classification accuracy fluctuates greatly. However, the fluctuation range of the EGODT is much smaller than that of the other 3 classifiers, and the fluctuation of the EGODT is quickly converged to a smaller range along with the increase of the number of the base classifiers, and is always kept in the smaller range in the subsequent mining sequence, while the fluctuation of the ECart-LC is also quickly converged to a certain range, but the fluctuation range is larger; the fluctuation range of EC4.5 tends to be stable after 700 points; the fluctuation range of EGODT and ECart-LC is small for the whole mining sequence of the heterogeneous Activity dataset, while the fluctuation range of EC4.5 is expanded after 1500 points, and the fluctuation range of EHoeffdingTree tends to be stable after 750 points. Fig. 2(a) - (d) verify that EGODT can obtain higher classification performance under a smaller sliding window, and this characteristic enables further attempts to reduce the size of the sliding window to adapt to the burst-type or gradual-type concept drift and reduce communication cost under the distributed data stream big data classification environment.

2) Training time testing of integrated classifiers

From fig. 5(a) - (c), it can be seen that as wt increases from 5 to 40, the time for one update learning of 4 integrated classifiers increases because the training data collected in the wt period increases if wt increases, and thus the learning time lengthens, moreover, on KDDCUP99 dataset, the training time of EGODT is less than that of the other 3 classifiers, while on heterologeyactivity dataset, EGODT is comparable to that of EC4.5 and lower than ECart-LC., and finally on RecordLinkage dataset, EGODT is close to that of EC4.5, ECart and lower than eheffdingtree.

In summary, the average training time of the GODT is close to the univariate decision tree C4.5 and lower than that of the multivariate decision tree Cart-LC, because the recursive projection strategy of the GODT causes at least one leaf node to be generated by splitting each non-leaf node, so the number of examples to be split is reduced from the root node, and finally the number of overall splitting times is reduced, thereby reducing the generation time of the decision tree.

3) Testing at different numbers of basis classifiers

As can be seen from fig. 4, for KDDCUP99 dataset, the classification accuracy of EGODT and ECart-LC at n-5 is already close to that of EC4.5 at n-30, and higher than that of EHoeffdingTree at n-30. In summary, fig. 6(a) - (c) verify that the representation capability of the GODT and Cart-LC is stronger, and the number of the GODT and Cart-LC is smaller when the same decision boundary is represented.

4) Diversity testing of base classifiers

Tables 2-5 show the average discrepancy amount results of 4 base classifiers over 3 datasets with wt-10 and n-10.

TABLE 2 amount of incoordination between base classifiers for EGODT

TABLE 3 amount of incoordination between base classifiers of EC4.5

TABLE 4 amount of incoordination between base classifiers for ECart-LC

TABLE 5 amount of mismatching between base classifiers for EHoeffdingTree

As can be seen from tables 2-5, the mean and variance of the unmeasured samples of GODT are:

while the mean and variance of the unmeasurable samples for C4.5, Cart-LC and HoeffingTree are: (0.36, 0.038), (0.40, 0.025) and (0.38, 0.028). the mean value of the rejection metric sample of GODT is higher than that of the other 3 classifiers, and the minimum rejection metric of the neighbor classifier is 0.12, while the rejection metrics of the C4.5, Cart-LC and HoeffingTree neighbor classifiers are 0.02, 0 and 0.08, respectively, which indicates that the diversity of GODT is stronger, and in addition, the variance of GODT is the same as that of Cart-LC and lower than that of the other two classifiers, which indicates that the diversity is relatively stable.

Claims

1. A distributed big data classification method based on a multivariate decision tree model is characterized by comprising the following steps:

the local node classifies the random online arrived unknown classification label samples by using an integrated classifier shared by the central node, marks the classification labels, and stores the samples with the known classification labels and the credibility exceeding a preset threshold value into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and then emptying the data set;

the central node shares the integrated classifier to the local node, and the local node classifies the online arrived streaming big data by using the integrated classifier;

sorting and grouping the projection points on the number axis of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets, dividing the ordered projection point sets into a plurality of subsets, and marking the projection points in the difference set as leaf nodes; marking the projection points in the intersection as intermediate nodes;

2. The distributed big data classification method based on the multivariate decision tree model as claimed in claim 1, wherein the determining the multivariate decision tree model by using the recursive projection splitting method under the guidance of the optimal reference vector comprises:

calculating a reference vector when the intersection of the current projection point subsets is minimum, namely an optimal reference vector;

re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a father node of the new ordered projection point set, and re-calculating a reference vector when the intersection of subsets in the current projection point sets is minimum, namely an optimal reference vector;

3. The multivariate decision tree model-based distributed big data classification method of claim 1, wherein the geometric contour similarity function measures similarity between multidimensional objects by using similarity between samples and between feature variables within samples.