CN107590263B - Distributed big data classification method based on multivariate decision tree model - Google Patents

Distributed big data classification method based on multivariate decision tree model Download PDF

Info

Publication number
CN107590263B
CN107590263B CN201710864745.6A CN201710864745A CN107590263B CN 107590263 B CN107590263 B CN 107590263B CN 201710864745 A CN201710864745 A CN 201710864745A CN 107590263 B CN107590263 B CN 107590263B
Authority
CN
China
Prior art keywords
decision tree
tree model
data
projection
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710864745.6A
Other languages
Chinese (zh)
Other versions
CN107590263A (en
Inventor
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN201710864745.6A priority Critical patent/CN107590263B/en
Publication of CN107590263A publication Critical patent/CN107590263A/en
Application granted granted Critical
Publication of CN107590263B publication Critical patent/CN107590263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed big data classification method based on a multivariate decision tree model comprises the following steps: the local node classifies the random online arriving unknown class label samples by using an integrated classifier shared by the central node, and stores the known class label samples with the credibility exceeding a preset threshold into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and emptying the data set; combining data sets sent by local nodes by a central node to generate a training sample set, training a multivariate decision tree model based on geometric contour similarity, adding the multivariate decision tree model into an integrated classifier as a base classifier, and periodically updating the integrated classifier; and sharing the integrated classifier to the local node, and classifying the online arriving streaming big data by the local node by using the integrated classifier. The multivariate decision tree based on the similarity of the geometric outlines is applied to an integrated classifier, and the problem of distribution flow type big data classification of the normalized data form is effectively solved.

Description

Distributed big data classification method based on multivariate decision tree model
Technical Field
The invention relates to the technical field of big data classification, in particular to a distributed big data classification method based on a multivariable decision tree model.
Background
Classification is one of the important tasks of data mining, and is also a problem of extensive research in related fields such as machine learning, pattern recognition and artificial intelligence. Classification has a wide range of practical applications including medical diagnosis, credit assessment, selective shopping, face recognition, etc.
The rapid development of emerging information technologies and application modes such as cloud computing, internet of things, mobile interconnection, social media and the like promotes the rapid increase of global data volume and promotes the human society to enter a big data era. Big data contains big information, the big information refines big knowledge, and the big knowledge can help users improve insight and decision making power in a higher layer, a wider visual angle and a wider range, and can create unprecedented great value for human society. At the same time, however, the great total value is often hidden in large data, and the vivid features of extremely low value density, extremely irregular distribution, extremely deep information hiding degree, and extremely difficult to find useful value are exhibited. Classification mining of such large data faces many challenges compared to traditional data classification. Firstly, the traditional classification mining method is based on a single learning sample set, and the distributed collection characteristic of big data determines that the classification learning needs to be carried out in a distributed mode, so that the corresponding distributed learning strategy and method need to be researched; secondly, the streaming big data flowing dynamically is obviously different from the static data stored in the traditional database, all data cannot be stored at one time and then off-line mining is carried out, and an on-line real-time collection technology and an incremental mining method changing along with time must be explored; finally, the traditional classification mining technology has high requirements on the learning sample set, and the classification mining of distributed and streaming big data needs multi-node and multi-step cooperative processing, so that the purity of the learning sample set is difficult to ensure, and a classification technology with good robustness has to be explored aiming at the mining characteristics of the big data.
For the classification problem of data flow, the classification model based on ensemble learning is a better solution because it has high capability of resisting concept drift and high classification accuracy. However, in the face of distributed data stream big data classification, the existing integrated classification model based on the decision tree faces an urgent problem to be solved: however, the existing integrated classifier based on the decision tree mostly adopts the univariate decision tree as the base classifier, and the univariate decision tree can only generate the decision boundary parallel to the coordinate axis, so a large number of base classifiers are needed to correctly approximate the class boundary, which reduces the learning performance and the prediction efficiency of the integrated classification model, and is difficult to adapt to the application of intrusion detection and the like which needs rapid prediction.
Disclosure of Invention
In order to solve the problems, the invention provides a distributed big data classification method based on a multivariable decision tree model.
In order to achieve the purpose, the invention adopts the technical scheme that:
a distributed big data classification method based on a multivariate decision tree model comprises the following steps:
the local node classifies the random online arriving unknown class label samples by using an integrated classifier shared by the central node, and stores the samples with known class labels and the credibility exceeding a preset threshold value into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and then emptying the data set;
the central node combines the data sets sent by the local nodes to generate a training sample set, and the training sample set is utilized to train a multivariable decision tree model based on geometric contour similarity;
the central node adds the multivariable decision tree model as a base classifier into the integrated classifier, and periodically updates the integrated classifier;
the central node shares the integrated classifier to the local node, and the local node classifies the online arriving streaming big data by using the integrated classifier.
The training of the multivariate decision tree model based on the similarity of the geometric outlines by using the training sample set specifically comprises the following steps:
projecting different types of sample points in the m-dimensional space to a numerical axis of a one-dimensional space by using a geometric contour similarity function, wherein the upper and lower boundaries of projection point sets of different types are category projection boundaries of different types of sample data;
sorting and grouping the projection points on the axes of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets, dividing the ordered projection sets into a plurality of subsets, and marking the projection points in the difference set as leaf nodes; marking the projection points in the intersection as intermediate nodes;
and under the guidance of the optimal reference vector, determining a multivariate decision tree model by adopting a recursive projection splitting method.
Under the guidance of the optimal reference vector, determining a multivariate decision tree model by adopting a recursive projection splitting method, comprising the following steps of:
calculating a reference vector, namely an optimal reference vector, when the intersection of the projection point subsets is minimum, namely the deviation of the geometric outline similarity between the projection point subsets is maximum;
re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a father node of the new ordered projection point set, and re-calculating a reference vector which is the optimal reference vector when the intersection of all subsets in the current projection point sets is minimum, namely the geometric outline similarity deviation between the projection point subsets is maximum;
and if the intersection set between the projection point subsets is empty or the number of the sample points in the intersection set between the projection point subsets is less than a preset threshold value, stopping splitting to obtain the final multivariable decision tree model.
The geometric contour similarity function measures the similarity between multi-dimensional objects by using the similarity between samples and between characteristic variables in the samples.
Has the advantages that:
the effectiveness of the technology is proved through theoretical analysis and numerical verification, and the remarkable effects are highlighted in the following aspects:
(1) the optimal reference vector is solved based on the method for maximizing the similarity of the geometric profiles among the categories, so that the intersection of projection point sets of different categories is minimum under the guidance of the optimal reference vector, the minimization of a sample set with uncertain category attribution is realized, and the category boundary can be accurately learned. In addition, the class boundary is determined through the position relation among the sets, so that the learning process is less influenced by abnormal projection points, the robustness is better, and the class boundary can be accurately identified in a distributed streaming type big data environment with low data purity.
(2) Aiming at the projection overlapping area in the father node, the optimal reference vector is recalculated at the child node, so that the sample points in the overlapping area are correctly separated after being re-projected, the splitting problem of the projection overlapping area is solved, the integral splitting times are reduced, the training time of the decision tree is further reduced, and the construction of an integrated classification model facing the distributed data stream big data based on the decision tree becomes possible.
(3) The multivariable decision tree based on the similarity of the geometric outlines can generate decision boundaries of any angle, and compared with a univariate decision tree, the multivariable decision tree based on the similarity of the geometric outlines has stronger representation capability. Under the condition of representing the same decision boundary, compared with a univariate decision tree, the decision tree needs fewer base classifiers, so that the problem of the performance reduction of learning and prediction caused by the addition of the base classifiers in distributed big data classification can be effectively solved.
(4) The numerical experiment results show that: the multivariable decision tree based on the geometric contour similarity has high classification precision and low training time, and effectively combines the advantages of high learning efficiency of the univariate decision tree and strong representation capability of the multivariable decision tree. The multivariate decision tree based on the similarity of the geometric outlines is applied to the integrated classifier, so that the problem of classification of distributed flow type big data in a normalized data form can be effectively solved.
Drawings
Fig. 1 shows the variation of classification accuracy with the size of a sliding window in the embodiment of the present invention, (a) shows the variation of classification accuracy with the size of a sliding window for four classifiers in KDDCUP99 dataset, (b) shows the variation of classification accuracy with the size of a sliding window for four classifiers in a hetereogenetic Activity dataset, and (c) shows the variation of classification accuracy with the size of a sliding window for four classifiers in a Record Linkage dataset;
fig. 2 is a variation of the mining sequence of classification accuracy across the entire KDDCUP99 dataset when wt ═ 5(wt represents the sliding window size), (a) is a variation of the classification accuracy of EGODT across the mining sequence of the entire KDDCUP99 dataset, (b) is a variation of the classification accuracy of EC4.5 across the mining sequence of the entire KDDCUP99 dataset, (c) is a variation of the classification accuracy of ECart-LC across the mining sequence of the entire KDDCUP99 dataset, and (d) is a variation of the classification accuracy of EHoeffdingTree across the mining sequence of the entire KDDCUP99 dataset in the embodiment of the present invention;
fig. 3 is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy when wt ═ 5 in the embodiment of the present invention, (a) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of EGODT, (b) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of EC4.5, (c) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of ECart-LC, (d) is a variation of the mining sequence of the entire hetereogenetic Activity dataset with classification accuracy of ehoeeffdingtree;
fig. 4 is a variation of classification accuracy over the mining sequence of Record link dataset when wt is 5, (a) is a variation of classification accuracy of EGODT over the mining sequence of Record link dataset, (b) is a variation of classification accuracy of EC4.5 over the mining sequence of Record link dataset, (c) is a variation of classification accuracy of ECart-LC over the mining sequence of Record link dataset, and (d) is a variation of classification accuracy of EHoeffdingTree over the mining sequence of Record link dataset in the embodiment of the present invention;
fig. 5 shows the variation of training time with the size of the sliding window in the embodiment of the present invention, (a) is the variation of training time with the size of the sliding window for KDDCUP99 data set, (b) is the variation of training time with the size of the sliding window for hetereogenitylactivity data set, and (c) is the variation of training time with the size of the sliding window for Record Linkage data set;
fig. 6 shows the variation of classification accuracy with the number of basis classifiers in the embodiment of the present invention, (a) is the variation of classification accuracy with the number of basis classifiers for KDDCUP99 data sets, (b) is the variation of classification accuracy with the number of basis classifiers for hetereogenitauty activity data sets, and (c) is the variation of classification accuracy with the number of basis classifiers for Record Linkage data sets;
FIG. 7 is a flowchart of a distributed big data classification method based on a multivariate decision tree model according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A distributed big data classification method based on multivariate decision tree model, as shown in fig. 7, includes:
step 1, local node Lz( z 1, 2.. R.) samples of unknown class labels that arrive randomly online are classified and labeled with class labels using an integrated classifier shared by the central node G, and samples of known class labels with confidence levels exceeding a preset threshold are stored in a data set Dz. When the data set DzIs sent to the central node G when the capacity of (D) exceeds a preset threshold, and then the data set D is emptiedzAnd R is the number of local nodes. (initially, the training sample set with known classification labels is sampled for R times in a layering way, and the sampled data is respectively stored in a data set D1,D2,......,DR)
Step 2, the central node G merges the data sets D sent by the local nodes1,D2,......,DRGenerating a training sample set, and recording as T ═ X1,X2,....,Xn) Wherein X isiLet X be the ith sample of T (i ═ 1, 2.. times., n)i=(xi1,xi2,.....,xim),xijIs XiThe j-th characteristic variable (j ═ 1, 2.... times.m); and training the multivariate decision tree model based on the similarity of the geometric outlines by using the training sample set T.
The step 2 specifically comprises the following substeps:
step 2-1, projecting different types of sample points in an m-dimensional space to a numerical axis of a one-dimensional space by using a geometric contour similarity function, wherein the upper and lower boundaries of projection point sets of different types are category projection boundaries of different types of sample data;
step 2-2, sorting and grouping the projection points on the number axis of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets P1,P2,......,Pk(k is the number of classes in the training sample), the ordered set of projections is divided into a plurality of subsets, where P1,P2,......,PkThe intersection of (A) is denoted as Tp1,Tp2,......,Tps;P1,P2,......,PkIs recorded as Tq1,Tq2,......,Tqt. Due to difference set TqrThe projection points in ( r 1, 2.... t) belong to the same class, so the projection points in the difference set are labeled as leaf nodes; intersection TplThe projection points in (1, 2...., s) are not in the same class, and thus the projection points in the intersection are labeled as intermediate nodes.
Step 2-3, in the optimal reference vector SbUnder the guidance of (1), determining a multivariate decision tree model by adopting a recursive projection splitting method;
the step 2-3 specifically comprises the following substeps:
step 2-3-1, calculating a reference vector, namely an optimal reference vector, when the intersection of all the projection point subsets is minimum, namely the similarity of the geometric outlines among the projection point subsets is maximum;
define sample XcAnd XdGeometric profile similarity function ρ (X ≠ d) of (c, d ≠ 1, 2.,. n, and c ≠ d)c,Xd) Comprises the following steps:
Figure BDA0001415843770000051
wherein, variable
Figure BDA0001415843770000052
Variables of
Figure BDA0001415843770000053
Variables of
Figure BDA0001415843770000054
The geometric contour similarity function measures the similarity between multi-dimensional objects by using the similarity between samples and between characteristic variables in the samples.
Any two kinds (C)f、Cg) Projection of sample points onto a subset of projection points (P) in one-dimensional spacefAnd Pg) Calculate the optimal reference vector S when the intersection of (A) and (B) is minimumb
Is provided with CfMean vector of middle training sample is
Figure BDA0001415843770000055
CgMean vector of middle training sample is
Figure BDA0001415843770000056
Figure BDA0001415843770000057
And
Figure BDA0001415843770000058
has a deviation vector of
Figure BDA0001415843770000059
Let v equal (w)1a1,w2a2,....,wmam)TWherein w is1,w2,....,wmAs weight coefficients, reference vectors
Figure BDA0001415843770000061
Then S and
Figure BDA0001415843770000062
has a deviation of the geometric profile similarity of
Figure BDA0001415843770000063
According to a set w that maximizes f1,w2,....,wmTaking values and calculating an optimal reference vector
Figure BDA0001415843770000064
Or
Figure BDA0001415843770000065
Albeit at the optimal reference vector SbGuided lower intersection T ofplMinimum, but TplNot equal to Φ. If T is split directly according to the category attribution of the projection pointplThis results in splitting imbalance and reduced splitting efficiency, with each subset after splitting containing very few samples of the same class (extreme cases, even lessTo only one sample) that are split up until termination, the final decision tree is very complex and over-fit is severe.
Step 2-3-2, re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a parent node of the new ordered projection point set, and re-calculating a reference vector which is the optimal reference vector S 'when the intersection of all subsets in all the projection point sets is minimum, namely the similarity of geometric outlines among the projection point subsets is maximum'b
And 2-3-3, if the intersection between the projection point subsets is empty or the number of sample points in the intersection between the projection point subsets is less than a preset threshold value, stopping splitting to obtain a final multivariable decision tree model.
And 3, adding the multivariate decision tree model established in the step 2 as a base classifier into the integrated classifier by the central node, periodically updating the integrated classifier, setting the number of the base classifiers of the integrated classifier as N, and assuming that the current mining time point is thThe integrated classifier generated by the last mining point is
Figure BDA0001415843770000066
(
Figure BDA0001415843770000067
Is ech-1Medium u base classifier), sliding window wdhThe lower sample training set is TiIf M is less than N, using TiTraining M +1 th base classifier
Figure BDA0001415843770000068
If M > N, use TiTraining ech-1The medium-class base classifier with the highest classification error rate can ensure the difference among the base classifiers and realize the training of the base classifiers under the same condition.
And 4, the central node shares the integrated classifier to the local node, and the local node classifies the online arrived streaming big data by using the integrated classifier.
3 experiments are designed in total under the distributed data flow big data environment to verify the method of the invention:
experiment 1, testing the classification precision and training time of an integrated classifier egodt (ensemble of geometrical outlining information Decision Tree) of a multivariate Decision Tree (geometrical outline similarity Tree) based on Geometric contour similarity, and selecting a base classifier in related work: c4.5, constructing a contrast integration classifier by using HoeffingTree and Cart-LC: EC4.5, EHoeffdingTree, and ECart-LC;
experiment 2: testing the classification precision of the 4 kinds of integrated classifiers under different numbers of the base classifiers;
experiment 3: the degree of diversity of the base classifier GODT was tested and compared to C4.5, HoeffingTree, and Cart-LC.
Setting an experimental environment:
the experimental Distributed environment consists of 4 local nodes and 1 central Node, 5 virtual machines are constructed on 1 Workstation (CPU: E5-2620, memory: 40GB) by utilizing VM work, each virtual machine allocates 4G memories, 4 of the virtual machines are utilized as local nodes (Distributed nodes), 1 of the virtual machines is utilized as a central Node (Master Node), KDDCUP99 training Data sets and testing Data sets are Distributed and stored in the 4 local nodes, and a Stream Generator is deployed at each local Node to simulate the online arrival process of a Data Stream so as to form window Data, in order to better simulate the random arrival condition of the Data, the Data flow rate of each local Node can change along with the time, the change range is set to [1000, 3000]. a Data Collector and a global classifier are deployed at the central Node, the Data Collector collects the training Data for the global classifier, and the global classifier is responsible for training the classifier and distributes the generated classifier to the local nodes, and the local nodes replace the existing classifier by using the latest classifier and continue to classify the online arrived test data.
The real data sets KDDCUP99, Record link and heterogeneous Activity were selected as test data sets, and the details of the data are shown in table 1.
TABLE 1 data set
Figure BDA0001415843770000071
Comparative analysis of experimental results
1) Precision testing of integrated classifiers
As can be seen from fig. 1(a) - (c), on KDDCUP99 and the heterogeneous Activity dataset, as wt increases, the precision of 4 classifiers gradually increases, because the amount of training data is proportional to wt, and the larger wt, the more training data, the better learning quality of 4 classifiers; however, such a learning characteristic is not obvious on Record Linkage data set, because the data set is a binary data set, the class boundary is relatively simple, the training data amount at wt ═ 5 satisfies the generalization learning of most class boundaries, and therefore, the classification precision is improved only slightly with the increase of wt. In addition, compared with EC4.5 and EHoeffdingTree, the classification accuracy of EGODT and ECart-LC is better, and when wt is 5, the classification accuracy of EGODT is the highest, and at this time, the training data is the least, which represents the learning advantage of EGODT in the environment of small training set, because GODT is a decision tree constructed based on class boundary, each branch arc of non-leaf node corresponds to a split boundary, and the split boundaries are determined by class boundary of projection point.
To further illustrate this learning characteristic of the GODT, fig. 2(a) - (d), fig. 3(a) - (d), and fig. 4(a) - (d) show the learning process of the entire mining sequence for 4 ensemble classifiers at wt ═ 5. In the initial stage of the mining sequence of the KDDCUP99 data set, the classification accuracy of all 4 classifiers fluctuates greatly, which is caused by the updated learning strategy of the classifiers, because the number of base classifiers in the integrated classifiers is small in the initial stage, and the difference between the base classifiers is large, so that the classification accuracy fluctuates greatly. However, the fluctuation range of the EGODT is much smaller than that of the other 3 classifiers, and the fluctuation of the EGODT is quickly converged to a smaller range along with the increase of the number of the base classifiers, and is always kept in the smaller range in the subsequent mining sequence, while the fluctuation of the ECart-LC is also quickly converged to a certain range, but the fluctuation range is larger; the fluctuation range of EC4.5 tends to be stable after 700 points; the fluctuation range of EGODT and ECart-LC is small for the whole mining sequence of the heterogeneous Activity dataset, while the fluctuation range of EC4.5 is expanded after 1500 points, and the fluctuation range of EHoeffdingTree tends to be stable after 750 points. Fig. 2(a) - (d) verify that EGODT can obtain higher classification performance under a smaller sliding window, and this characteristic enables further attempts to reduce the size of the sliding window to adapt to the burst-type or gradual-type concept drift and reduce communication cost under the distributed data stream big data classification environment.
2) Training time testing of integrated classifiers
From fig. 5(a) - (c), it can be seen that as wt increases from 5 to 40, the time for one update learning of 4 integrated classifiers increases because the training data collected in the wt period increases if wt increases, and thus the learning time lengthens, moreover, on KDDCUP99 dataset, the training time of EGODT is less than that of the other 3 classifiers, while on heterologeyactivity dataset, EGODT is comparable to that of EC4.5 and lower than ECart-LC., and finally on RecordLinkage dataset, EGODT is close to that of EC4.5, ECart and lower than eheffdingtree.
In summary, the average training time of the GODT is close to the univariate decision tree C4.5 and lower than that of the multivariate decision tree Cart-LC, because the recursive projection strategy of the GODT causes at least one leaf node to be generated by splitting each non-leaf node, so the number of examples to be split is reduced from the root node, and finally the number of overall splitting times is reduced, thereby reducing the generation time of the decision tree.
3) Testing at different numbers of basis classifiers
As can be seen from fig. 4, for KDDCUP99 dataset, the classification accuracy of EGODT and ECart-LC at n-5 is already close to that of EC4.5 at n-30, and higher than that of EHoeffdingTree at n-30. In summary, fig. 6(a) - (c) verify that the representation capability of the GODT and Cart-LC is stronger, and the number of the GODT and Cart-LC is smaller when the same decision boundary is represented.
4) Diversity testing of base classifiers
Tables 2-5 show the average discrepancy amount results of 4 base classifiers over 3 datasets with wt-10 and n-10.
TABLE 2 amount of incoordination between base classifiers for EGODT
Figure BDA0001415843770000091
TABLE 3 amount of incoordination between base classifiers of EC4.5
Figure BDA0001415843770000101
TABLE 4 amount of incoordination between base classifiers for ECart-LC
Figure BDA0001415843770000102
TABLE 5 amount of mismatching between base classifiers for EHoeffdingTree
Figure BDA0001415843770000103
As can be seen from tables 2-5, the mean and variance of the unmeasured samples of GODT are:
Figure BDA0001415843770000104
while the mean and variance of the unmeasurable samples for C4.5, Cart-LC and HoeffingTree are: (0.36, 0.038), (0.40, 0.025) and (0.38, 0.028). the mean value of the rejection metric sample of GODT is higher than that of the other 3 classifiers, and the minimum rejection metric of the neighbor classifier is 0.12, while the rejection metrics of the C4.5, Cart-LC and HoeffingTree neighbor classifiers are 0.02, 0 and 0.08, respectively, which indicates that the diversity of GODT is stronger, and in addition, the variance of GODT is the same as that of Cart-LC and lower than that of the other two classifiers, which indicates that the diversity is relatively stable.

Claims (3)

1. A distributed big data classification method based on a multivariate decision tree model is characterized by comprising the following steps:
the local node classifies the random online arrived unknown classification label samples by using an integrated classifier shared by the central node, marks the classification labels, and stores the samples with the known classification labels and the credibility exceeding a preset threshold value into a data set; when the capacity of the data set exceeds a preset threshold value, sending the data set to a central node, and then emptying the data set;
the central node combines the data sets sent by the local nodes to generate a training sample set, and the training sample set is utilized to train a multivariable decision tree model based on geometric contour similarity;
the central node adds the multivariable decision tree model as a base classifier into the integrated classifier, and periodically updates the integrated classifier;
the central node shares the integrated classifier to the local node, and the local node classifies the online arrived streaming big data by using the integrated classifier;
the training of the multivariate decision tree model based on the similarity of the geometric outlines by using the training sample set specifically comprises the following steps:
projecting different types of sample points in the m-dimensional space to a numerical axis of a one-dimensional space by using a geometric contour similarity function, wherein the upper and lower boundaries of projection point sets of different types are category projection boundaries of different types of sample data;
sorting and grouping the projection points on the number axis of the one-dimensional space by utilizing the category projection boundary to obtain a group of ordered projection point sets, dividing the ordered projection point sets into a plurality of subsets, and marking the projection points in the difference set as leaf nodes; marking the projection points in the intersection as intermediate nodes;
and under the guidance of the optimal reference vector, determining a multivariate decision tree model by adopting a recursive projection splitting method.
2. The distributed big data classification method based on the multivariate decision tree model as claimed in claim 1, wherein the determining the multivariate decision tree model by using the recursive projection splitting method under the guidance of the optimal reference vector comprises:
calculating a reference vector when the intersection of the current projection point subsets is minimum, namely an optimal reference vector;
re-projecting the leaf nodes to form a new ordered projection point set, separating different types of sample points in a father node of the new ordered projection point set, and re-calculating a reference vector when the intersection of subsets in the current projection point sets is minimum, namely an optimal reference vector;
and if the intersection set between the projection point subsets is empty or the number of the sample points in the intersection set between the projection point subsets is less than a preset threshold value, stopping splitting to obtain the final multivariable decision tree model.
3. The multivariate decision tree model-based distributed big data classification method of claim 1, wherein the geometric contour similarity function measures similarity between multidimensional objects by using similarity between samples and between feature variables within samples.
CN201710864745.6A 2017-09-22 2017-09-22 Distributed big data classification method based on multivariate decision tree model Active CN107590263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710864745.6A CN107590263B (en) 2017-09-22 2017-09-22 Distributed big data classification method based on multivariate decision tree model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710864745.6A CN107590263B (en) 2017-09-22 2017-09-22 Distributed big data classification method based on multivariate decision tree model

Publications (2)

Publication Number Publication Date
CN107590263A CN107590263A (en) 2018-01-16
CN107590263B true CN107590263B (en) 2020-07-07

Family

ID=61047067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710864745.6A Active CN107590263B (en) 2017-09-22 2017-09-22 Distributed big data classification method based on multivariate decision tree model

Country Status (1)

Country Link
CN (1) CN107590263B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109068349B (en) * 2018-07-12 2021-08-06 重庆邮电大学 Indoor intrusion detection method based on small sample iterative migration
CN110889103B (en) * 2018-09-07 2024-04-05 京东科技控股股份有限公司 Method and system for verifying sliding block and model training method thereof
CN110309587B (en) * 2019-06-28 2024-01-16 京东城市(北京)数字科技有限公司 Decision model construction method, decision method and decision model
CN111447278B (en) * 2020-03-27 2021-06-08 第四范式(北京)技术有限公司 Distributed system for acquiring continuous features and method thereof
CN111754313B (en) * 2020-07-03 2023-09-26 南京大学 Efficient communication online classification method for distributed data without projection
CN112365352B (en) * 2020-11-30 2023-07-04 西安四叶草信息技术有限公司 Anti-cash-out method and device based on graph neural network
CN112637084B (en) * 2020-12-10 2022-09-23 中山职业技术学院 Distributed network flow novelty detection method and classifier

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737126A (en) * 2012-06-19 2012-10-17 合肥工业大学 Classification rule mining method under cloud computing environment
CN103838617A (en) * 2014-02-18 2014-06-04 河海大学 Method for constructing data mining platform in big data environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7539664B2 (en) * 2001-03-26 2009-05-26 International Business Machines Corporation Method and system for operating a rating server based on usage and download patterns within a peer-to-peer network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737126A (en) * 2012-06-19 2012-10-17 合肥工业大学 Classification rule mining method under cloud computing environment
CN103838617A (en) * 2014-02-18 2014-06-04 河海大学 Method for constructing data mining platform in big data environment

Also Published As

Publication number Publication date
CN107590263A (en) 2018-01-16

Similar Documents

Publication Publication Date Title
CN107590263B (en) Distributed big data classification method based on multivariate decision tree model
Cano et al. Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
Wu et al. On quantitative evaluation of clustering systems
Das et al. Metaheuristic clustering
Gabrys et al. Combining labelled and unlabelled data in the design of pattern classification systems
Tang et al. Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop
Barddal et al. A survey on feature drift adaptation
CN113344128B (en) Industrial Internet of things self-adaptive stream clustering method and device based on micro clusters
Zemmal et al. A new hybrid system combining active learning and particle swarm optimisation for medical data classification
CN102831432A (en) Redundant data reducing method suitable for training of support vector machine
Prachuabsupakij CLUS: A new hybrid sampling classification for imbalanced data
CN110543913A (en) Genetic algorithm-based neighbor propagation clustering method
Ahsani et al. Improvement of CluStream algorithm using sliding window for the clustering of data streams
Karim et al. An adaptive ensemble classifier for mining complex noisy instances in data streams
Perez et al. Mahalanobis distance metric learning algorithm for instance-based data stream classification
Devi et al. A proficient method for text clustering using harmony search method
Thangam et al. Exponential kernelized feature map Theil-Sen regression-based deep belief neural learning classifier for drift detection with data stream
Angbera et al. An adaptive XGBoost-based optimized sliding window for concept drift handling in non-stationary spatiotemporal data streams classifications
Devi et al. Hybridized harmony search method for text clustering using concept factorization
Rong et al. Location bagging-based undersampling for imbalanced classification problems
Wattanakitrungroj et al. Versatile hyper-elliptic clustering approach for streaming data based on one-pass-thrown-away learning
Orliński et al. O (m log m) instance selection algorithms—RR-DROPs
Jankowski et al. Fast Encoding length-based prototype selection algorithms.
Thenmozhi et al. Weighed Quantum Particle Swarm Optimization Technique to Measure the Student Performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant