CN106934410A - The sorting technique and system of data - Google Patents

The sorting technique and system of data Download PDF

Info

Publication number
CN106934410A
CN106934410A CN201511020318.7A CN201511020318A CN106934410A CN 106934410 A CN106934410 A CN 106934410A CN 201511020318 A CN201511020318 A CN 201511020318A CN 106934410 A CN106934410 A CN 106934410A
Authority
CN
China
Prior art keywords
data
sorting algorithm
classification
classifier
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511020318.7A
Other languages
Chinese (zh)
Inventor
赵科科
王晓光
李文鹏
漆远
张柯
杨强鹏
隋宛辰
俞吴杰
杨旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201511020318.7A priority Critical patent/CN106934410A/en
Publication of CN106934410A publication Critical patent/CN106934410A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In the sorting technique and system of the data that the application is provided, using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired einer Primargrosse;The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;The primary test set of one labeled data classification results of selection;Primary test set is classified using first-level class device, generation is by testing classification result and has marked the secondary training set that classification results are constituted;The second sorting algorithm is selected from sorting algorithm collection;Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary parameter;The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;Combination first-level class device and secondary classifier form assembled classifier, classify with to data.Accurate combining classifiers will be classified to together, the accuracy of classification can be improved.

Description

The sorting technique and system of data
Technical field
The application is related to big data technology, more particularly to a kind of application machine learning to solve the method for data classification and be System.
Background technology
In the process of construction of credit investigation system, introduce machine learning algorithm combined with code of points, can solve enterprise and The quantification problem of personal credit.
Computer is learnt according to machine learning algorithm for the sample for having marked, it is thus possible to induction and conclusion goes out sample The regularity of distribution or distribution rule of element in this between different classifications.The regularity of distribution gone out using induction and conclusion or distribution Rule, can classify to the sample not marked, that is to say, that these elements not marked are mapped into affiliated classification On.
In the prior art, have to the method that the credit data of crowd is classified various.Common sorting algorithm includes: Decision tree, Bayes, k nearest neighbor, SVMs, based on correlation rule, integrated study, artificial neural network.
During sorting algorithm, the regularity of distribution of the induction and conclusion element between different classifications is utilized, can be by The sample for having marked generates the parameter relevant with the attribute of credit data as training set.Class belonging to parameter influence element Not.These parameters are corresponding generally with a certain middle sorting algorithm, and both are collectively referred to as disaggregated model, or grader.These parameters, Also referred to as model parameter.In order to characterize the performance of grader, i.e. sorting algorithm and its corresponding parameter to credit data sample point The accuracy of class, can be tested by test set.When element classification during a grader is to test set, correctly divided The number of elements of class is more, then the performance of grader is better.
During prior art is realized, inventor has found that at least there are the following problems in the prior art:
Crowd can be several species according to Attribute transpositions such as age, educational background, conditions of assets.Generally, different grader There are different performances when classifying to different types of credit data.That is, for same kind of crowd, no The degree of accuracy of same grader classification is different.There is no a kind of grader in global sample, it is, having in the crowd of whole exhausted The degree of accuracy to advantage.
Accordingly, it is desirable to provide a kind of classification degree of accuracy of data to global sample technical scheme high.
The content of the invention
The embodiment of the present application provides a kind of degree of accuracy technical scheme high of classifying of data to global sample.
Specifically, a kind of sorting technique of data, including:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier, classify with to data;
Data are classified using assembled classifier;
Wherein, the data are the characteristic vector of multidimensional attribute.
The embodiment of the present application also provides a kind of categorizing system of data, including:
Memory module, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm Collection;
MBM, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
Selection one has marked the primary test set of the classification results of credit data;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module, for being classified to data using assembled classifier.
The sorting technique and system of the data that the embodiment of the present application is provided, at least have the advantages that:
Accurate combining classifiers will be classified to together, the accuracy of classification can be improved.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In the accompanying drawings:
The process schematic of the data classification that Fig. 1 is provided for the embodiment of the present application.
Elementary training collection and the graph of a relation of secondary training set that Fig. 2 is provided for the embodiment of the present application.
The sorting technique flow chart of the data that Fig. 3 is provided for the embodiment of the present application.
The structural representation of the categorizing system of the data that Fig. 4 is used for the embodiment of the present application.
Specific embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
In the construction of credit investigation system, big data technology certainly will be used.In big data technology, machine learning and data Mining algorithm is an important ring.The evaluation and prediction quantified to enterprise and personal credit by these algorithms and model, Such that it is able to instruct how by resources such as assets, cash flows with relatively low risk input to production, improve production efficiency.
Data in credit investigation system are the characteristic vector of multidimensional attribute.Specifically, for example, data include but is not limited to surname Name, sex, age, occupation, house property, vehicle, marketable securities, monthly income, moon consumption, credit line, overdue number of times, maximum are overdue The attribute of each dimension such as number of days.Can be used to quantify after the corresponding property value of the attribute of these dimensions or characteristic value are quantized Represent the credit level or credit level of enterprise customer or personal user.
The distribution of data has the rule of cluster.Crowd is divided into some classifications, the sample of each classification can be used Average credit value assess the credit value of each element in sample.Here element, can refer to a people.Therefore, sample In each element credit value the degree of accuracy, depend on the degree of accuracy of element classification in sample.
Common sorting algorithm includes:Decision tree, Bayes, k nearest neighbor, SVMs, based on correlation rule, integrated Habit, artificial neural network.
Decision tree is one of common methods for being classified with being predicted.Traditional decision-tree, is to classify to tie from labeled data The training set induction and conclusion of fruit goes out classifying rules.Attribute i.e. for sample builds an attribute classification relational tree.Attribute classification Relational tree selects different attributes that the relation between attribute and classification is built as the node in tree according to certain rule. Can use and build this attribute classification relational tree from the lower recurrence in top.The leaf node of tree is each classification, non-leaf section Point is attribute, and the line between node is the different spans of nodal community.After decision tree builds, just from decision-making tree root Node starts from top to bottom to needing to carry out the element of classification mark, carries out the comparing of property value, finally reaches certain leaf section Point.Classification corresponding to the leaf node is the classification of the element.Conventional decision Tree algorithms have ID3, C4.5/C5.0, CART etc..Whether the difference of these algorithms is essentially consisted in, the strategy of Attributions selection, the structure of decision tree, using beta pruning and cut Branch method, whether process large data sets etc..When the selection of attribute span is reasonable, classification accuracy is high.Can be by training Collection, optimizes the parameters such as attribute span corresponding with decision tree.A kind of attainable mode is, from special parameter so that Traditional decision-tree is for the data element classification degree of accuracy highest in training set.Generally, a kind of sorting algorithm and and sorting algorithm Corresponding parameter is also referred to as disaggregated model or grader.From more reasonably parameter, it is, the optimization of disaggregated model or point The optimization of class device.
Bayesian Classification Arithmetic is the algorithm classified to element based on the Bayesian formula in probability theory.The algorithm makes With Bayesian formula, calculating elements belong to the conditional probability of each classification, the classification conduct corresponding to alternative condition maximum probability Its classification.Common Bayesian Classification Arithmetic includes naive Bayesian, Bayesian network.Naive Bayesian, Bayesian network Difference be the assumption that between attribute whether conditional sampling.Naive Bayesian is conditional sampling between assuming attribute, and Bayes Network is related between assuming that part attribute.With traditional decision-tree similarly, the relevance between attribute can also It is considered a kind of parameter corresponding with class algorithm.
K nearest neighbor algorithm is the sorting algorithm based on element.The algorithm defines a neighbor scope first, that is, set neighbours' Number.Then, the strategies of minority are defeated come the classification belonging to decision element, i.e. majority by the way of ballot.The classification of element Classification corresponding to major part in neighbours' element.Euclidean distance is typically all used, that is, has chosen Euclidean distance nearest K The sample of classification is marked as the neighbours of oneself.Both the mode that neighbours' equality can have been taken to vote, it is also possible to take neighbor weight The mode of value is voted.The mode of neighbor weight value is taken to be voted, i.e., the opinion of different neighbours there are different power Weight.The nearer neighbor weight of general distance is bigger.Equally, the number of neighbours here, it is also possible to be considered a kind of and calculated with classification The corresponding parameter of method.
For sorting algorithms such as SVMs, the grader based on correlation rule, integrated study, artificial neural networks Speech, training sample error, error in classification, weighted value of attribute etc. may be considered parameter corresponding with sorting algorithm.
By training set, the corresponding parameter of sorting algorithm is optimized, the accuracy of data classification can be improved.
Fig. 1 is refer to, is the sorting technique of the data that the embodiment of the present application is provided, specifically include following steps:
S01:The elementary training collection of one labeled data classification results of selection.
Table 1
Table 1 is the signal list of the data acquisition system of labeled data classification results.In the list, the data of all users Gather as a sample, and it is corresponding, and a user can be as sample element.Each element can have year The attribute of the multiple dimension such as age, position.The classification results of the element of each in sample can be marked, for example with C1, C2, C3 mode are marked.Specifically, C1, C2, C3 can take 0 value or take 1 value.
Here the elementary training collection for selecting can be the part randomly selected in data acquisition system.
S02:The first sorting algorithm is selected from sorting algorithm collection.
Sorting algorithm collection is adapted for the set of the algorithm of classification.Sorting algorithm collection can include decision tree, Bayes's classification The many algorithms such as device, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network.Above Decision Tree algorithms, Bayes classifier algorithm, k nearest neighbor algorithm are briefly explained, and SVMs, based on association Remaining algorithm such as grader, integrated study, the artificial neural network of rule has special works to be subject in machine algorithm field Illustrate, then repeat no more here.The embodiment of the present application selects an algorithm in this step, therefrom.It is of course also possible to repeat Carry out, so as to select multiple sorting algorithms.
S03:Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets the phase The einer Primargrosse of prestige.
For decision Tree algorithms, the span of property value or characteristic value has shadow to the classification results of data element Ring.Further, different attributes can have different degrees of influence to the classification results of data element.Using decision tree Algorithm is carried out during induction and conclusion goes out classifying rules, it can be assumed that a series of span of property value, it is also possible to false If a series of weighted value of different attribute.Iterate to calculate out the optimal span of property value, or different attribute optimal power Weight values, so that the sorting algorithm of decision tree meets desired value to the classification accuracy of the data element of elementary training collection, or with Make the sorting algorithm of decision tree to the classification accuracy highest of the data element of elementary training collection.
For Bayes classifier algorithm, the relevance between different attribute has to the classification results of data element Influence.Further, different attributes can have different degrees of influence to the classification results of data element.Using decision-making Tree algorithm is carried out during induction and conclusion goes out classifying rules, it can be assumed that a series of degree of correlation between some attributes, Assume that a series of weighted value of different attribute.The optimal correlation coefficient between attribute is iterated to calculate out, or it is different The optimal weights value of attribute, so that Bayes classifier algorithm meets the phase to the classification accuracy of the data element of elementary training collection Prestige value, or so that Bayes classifier algorithm is to the classification accuracy highest of the data element of elementary training collection.
For k nearest neighbor algorithm, the number of the neighbours of data element has influence to the classification results of data element.Enter One step, different attributes can have different degrees of influence to the classification results of data element.Enter using k nearest neighbor algorithm During row induction and conclusion goes out classifying rules, it can be assumed that a series of span of the number of neighbours, it may also assume that The a series of weighted value of different attribute.Iterate to calculate out the optimal value of the number of neighbours, or different attribute optimal weights Value, so that the sorting algorithm of k nearest neighbor meets desired value, or so that K to the classification accuracy of the data element of elementary training collection Classification accuracy highest of the sorting algorithm of neighbour to the data element of elementary training collection.
Certainly, for other sorting algorithms, such as SVMs, the grader based on correlation rule, integrated Habit, artificial neural network etc., parameter corresponding with sorting algorithm presented hereinbefore can be with identical, it is also possible to different.Finally, pass through The use of elementary training set pair sorting algorithm, can obtain it is corresponding with the first sorting algorithm, meet desired einer Primargrosse.It is right For the sorting algorithm of SVMs, einer Primargrosse can be including training sample error, error in classification etc..For based on pass Join for the sorting algorithm of grader, integrated study, the artificial neural network of rule etc., einer Primargrosse can include the power of attribute Weight values.
S04:The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse.
As the einer Primargrosse often between sorting algorithm listed above, different with some Special Categories, have Sorting algorithm between can also have common einer Primargrosse.Sorting algorithm and corresponding einer Primargrosse, may be constructed For the grader that data sample plays classification, or disaggregated model.These einer Primargrosses, it is also possible to be considered disaggregated model Model parameter.
S05:The primary test set of one labeled data classification results of selection.
In the embodiment of the present application, a data acquisition system for labeled data classification results can be selected, with testing classification device Classification accuracy.
Further, in the another embodiment that the application is provided, a kind of system of selection of primary test set is also provided.Tool Body, be the equal N number of subdata set of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining N-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
S06:Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results The secondary training set of composition.
In a kind of attainable mode that the embodiment of the present application is provided, by data acquisition system S, random division data are for substantially Identical J one's share of expenses for a joint undertaking data acquisition systems.Therefrom select a subdata set SjAs primary test set, remaining J-1 one's shares of expenses for a joint undertaking data Set is used as elementary training collection corresponding with primary test set.From sorting algorithm collection { z1, z2... ... zkIn, kth is selected successively, K ∈ (1, K) individual algorithm, is then trained with elementary training collection, obtains a grader, or be disaggregated modelIts In-j represent with jth one's share of expenses for a joint undertaking data acquisition system Sj, as primary test set, except SjOuter J-1 one's share of expenses for a joint undertaking data acquisition systems are used as training set. Then, with primary test set testing classification deviceA classification results Z can be obtainedK, j。ZK, jRepresent with k-th algorithm pair The first-level class device answered is to jth one's share of expenses for a joint undertaking data acquisition system SjClassification results.By testing classification result and classification knot can be marked Fruit constitutes secondary training set, i.e. { Z1, j, Z2, j... ... ZK, j, Yj}.Wherein YjRepresent jth one's share of expenses for a joint undertaking data acquisition system SjWhat is marked divides Class result.
S07:The second sorting algorithm is selected from sorting algorithm collection.
It is similar with step S02, another sorting algorithm can be selected here.Certainly, sorting algorithm here can be Decision tree, Bayes classifier, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, ANN One kind in a kind of algorithm in network.
S08:Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets the phase The secondary parameter of prestige.
It is similar with step S03, can obtain meet desired einer Primargrosse here.Here secondary parameter can include surveying Try relevance, the number of the neighbours of testing classification result, the test between span, the testing classification result of classification results At least one of training sample error, the error in classification of testing classification result, weighted value of testing classification result of classification results.
S09:The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter.
It is similar with step S04, the secondary classifier defined by the second sorting algorithm and secondary parameter can be obtained here.
S10:Combination first-level class device and secondary classifier form assembled classifier, classify with to data.
Further, in the another embodiment that the application is provided, first-level class device and secondary classifier formation group are combined Grader is closed, is classified with to data, specifically included:Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, build respectively not Same assembled classifier undetermined;
The secondary test set of one labeled data classification results of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
Einer Primargrosse is the parameter to the attributes defining of data.Secondary parameter be to testing classification result and marked point The parameter that class result is limited, is still the parameter to the attributes defining of data finally.Therefore, the ginseng corresponding to assembled classifier Number is still the parameter to the attributes defining of data.Assembled classifier can classify to data.
S11:Data are classified using assembled classifier.
In the embodiment that the application is provided, using the elementary training collection, pair parameter corresponding with the first sorting algorithm Optimize, acquisition meets desired einer Primargrosse, build the first-level class device defined by the first sorting algorithm and einer Primargrosse, The first-level class device of the corresponding optimization of each sorting algorithm can be obtained by the step.Further, using first-level class Device is classified to primary test set, generates testing classification result, such that it is able to select accuracy rate highest one in testing classification result Level grader, that is to say, that optimal first-level class device in various sorting algorithms can be obtained by the step.Further , using secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary parameter; The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter, can be obtained and one-level by the step Grader has most strong complementary secondary classifier, that is to say, that by the combination of these steps, be finally obtained classification accurate True property highest assembled classifier.
Above is the sorting technique of the data that the embodiment of the present application is provided, based on same thinking, refer to Fig. 4, this Shen A kind of categorizing system 1 of data is please also provided, including:
Memory module 11, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm Collection;
MBM 12, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired Einer Primargrosse;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation is by testing classification result and has marked classification results and constitutes Secondary training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired Secondary parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module 13, for being classified to data using assembled classifier.
Further, in the another embodiment that the application is provided, memory module 11 stores the classification knot of labeled data The data acquisition system of fruit;
MBM 12, is used for:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
Further, in the another embodiment that the application is provided, the first sorting algorithm at least includes decision tree, Bayes A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network Method;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute At least one of difference, error in classification, weighted value of attribute.
Further, in the another embodiment that the application is provided, the second sorting algorithm at least includes decision tree, Bayes A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network Method;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results At least one of weighted value of classification results.
Further, in the another embodiment that the application is provided, the MBM 12 is used for:Combination first-level class Device and secondary classifier form assembled classifier, classify with to data, specifically for:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
The application provide embodiment in, the application provide embodiment in, using the elementary training collection, pair with The corresponding parameter of first sorting algorithm is optimized, and acquisition meets desired einer Primargrosse, and structure is by the first sorting algorithm and just The first-level class device of level parameter definition, the first-level class of the corresponding optimization of each sorting algorithm can be obtained by the step Device.Further, primary test set is classified using first-level class device, generates testing classification result, such that it is able to select test Accuracy rate highest first-level class device in classification results, that is to say, that can be obtained in various sorting algorithms most by the step Excellent first-level class device.Further, using secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, Acquisition meets desired secondary parameter;The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter, Can be obtained by the step has most strong complementary secondary classifier with first-level class utensil, that is to say, that walked by these Rapid combination, is finally obtained classification accuracy highest assembled classifier.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable numerical value processing equipments is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable numerical value processing equipments The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable numerical value processing equipments with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable numerical value processing equipments so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information Store.Information can be computer-readable instruction, value structure, the module of program or other numerical value. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, can be used to store the information that can be accessed by a computing device.Defined according to herein, calculated Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as numerical signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of key elements not only include those key elements, but also wrapping Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment is intrinsic wants Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.And, the application can be used to be can use in one or more computers for wherein including computer usable program code and deposited The shape of the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent Replace, improve etc., within the scope of should be included in claims hereof.

Claims (10)

1. a kind of sorting technique of data, it is characterised in that including:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired primary Parameter;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
The primary test set of one labeled data classification results of selection;
Primary test set is classified using first-level class device, generation by testing classification result and marked that classification results constitute it is secondary Level training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary Parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier, classify with to data;
Data are classified using assembled classifier;
Wherein, the data are the characteristic vector of multidimensional attribute.
2. the method for claim 1, it is characterised in that the elementary training collection and the primary test set meet following Relation:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
3. the method for claim 1, it is characterised in that first sorting algorithm at least includes decision tree, Bayes A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network Method;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute At least one of difference, error in classification, weighted value of attribute.
4. the method for claim 1, it is characterised in that second sorting algorithm at least includes decision tree, Bayes A kind of calculation in grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network Method;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results At least one of weighted value of classification results.
5. the method for claim 1, it is characterised in that combination first-level class device and secondary classifier form assembled classification Device, classifies with to data, specifically includes:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Credit data is classified using selected assembled classifier.
6. a kind of categorizing system of data, it is characterised in that including:
Memory module, for storing the elementary training collection of labeled data classification results, primary test set, sorting algorithm collection;
MBM, is used for:
The elementary training collection of one labeled data classification results of selection;
The first sorting algorithm is selected from sorting algorithm collection;
Using the elementary training collection, pair parameter corresponding with the first sorting algorithm is optimized, and acquisition meets desired primary Parameter;
The first-level class device that structure is defined by first sorting algorithm and the einer Primargrosse;
Selection one has marked the primary test set of the classification results of credit data;
Primary test set is classified using first-level class device, generation by testing classification result and marked that classification results constitute it is secondary Level training set;
The second sorting algorithm is selected from sorting algorithm collection;
Using the secondary training set, pair parameter corresponding with the second sorting algorithm is optimized, and acquisition meets desired secondary Parameter;
The secondary classifier that structure is defined by second sorting algorithm and the secondary parameter;
Combination first-level class device and secondary classifier form assembled classifier;
Sort module, for being classified to data using assembled classifier.
7. categorizing system as claimed in claim 6, it is characterised in that the memory module, the classification of storage labeled data The data acquisition system of result;
The MBM, is used for:
It is equal J sub- data acquisition system of sample size by data acquisition system random division;
One of Sub Data Set cooperation is taken for the primary test set;
Set remaining J-1 sub- data acquisition systems as the corresponding elementary training collection of the primary test set.
8. categorizing system as claimed in claim 6, it is characterised in that first sorting algorithm at least includes decision tree, shellfish In leaf this grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network one Plant algorithm;
Relevance, the number of neighbours, training sample mistake of the einer Primargrosse at least between span, attribute including attribute At least one of difference, error in classification, weighted value of attribute.
9. categorizing system as claimed in claim 6, it is characterised in that second sorting algorithm at least includes decision tree, shellfish In leaf this grader, k nearest neighbor, SVMs, the grader based on correlation rule, integrated study, artificial neural network one Plant algorithm;
Relevance, survey of the secondary parameter at least between span, testing classification result including testing classification result Try number, the training sample error of testing classification result, the error in classification of testing classification result, the test of the neighbours of classification results At least one of weighted value of classification results.
10. categorizing system as claimed in claim 6, it is characterised in that the MBM, is used for:Combination first-level class device Assembled classifier is formed with secondary classifier, is classified with to data, specifically for:
Repeat to extract two kinds of sorting algorithms from sorting algorithm collection, different assembled classifiers undetermined are built respectively;
The secondary test set of one classification results of labeled data of selection;
The accuracy that the different assembled classifier undetermined of statistics is classified to secondary test set;
Selected accuracy highest assembled classifier undetermined;
Data are classified using selected assembled classifier.
CN201511020318.7A 2015-12-30 2015-12-30 The sorting technique and system of data Pending CN106934410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511020318.7A CN106934410A (en) 2015-12-30 2015-12-30 The sorting technique and system of data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511020318.7A CN106934410A (en) 2015-12-30 2015-12-30 The sorting technique and system of data

Publications (1)

Publication Number Publication Date
CN106934410A true CN106934410A (en) 2017-07-07

Family

ID=59441495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511020318.7A Pending CN106934410A (en) 2015-12-30 2015-12-30 The sorting technique and system of data

Country Status (1)

Country Link
CN (1) CN106934410A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107560845A (en) * 2017-09-18 2018-01-09 华北电力大学 A kind of Fault Diagnosis of Gear Case method for building up and device
CN109087145A (en) * 2018-08-13 2018-12-25 阿里巴巴集团控股有限公司 Target group's method for digging, device, server and readable storage medium storing program for executing
CN109324604A (en) * 2018-11-29 2019-02-12 中南大学 A kind of intelligent train resultant fault analysis method based on source signal
CN110134646A (en) * 2019-05-24 2019-08-16 安徽芃睿科技有限公司 The storage of knowledge platform service data and integrated approach and system
CN112396114A (en) * 2020-11-20 2021-02-23 中国科学院深圳先进技术研究院 Evaluation system, evaluation method and related product
CN112507170A (en) * 2020-12-01 2021-03-16 平安医疗健康管理股份有限公司 Data asset directory construction method based on intelligent decision and related equipment thereof
CN112801233A (en) * 2021-04-07 2021-05-14 杭州海康威视数字技术股份有限公司 Internet of things equipment honeypot system attack classification method, device and equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107560845A (en) * 2017-09-18 2018-01-09 华北电力大学 A kind of Fault Diagnosis of Gear Case method for building up and device
CN109087145A (en) * 2018-08-13 2018-12-25 阿里巴巴集团控股有限公司 Target group's method for digging, device, server and readable storage medium storing program for executing
CN109324604A (en) * 2018-11-29 2019-02-12 中南大学 A kind of intelligent train resultant fault analysis method based on source signal
CN110134646A (en) * 2019-05-24 2019-08-16 安徽芃睿科技有限公司 The storage of knowledge platform service data and integrated approach and system
CN110134646B (en) * 2019-05-24 2021-09-07 安徽芃睿科技有限公司 Knowledge platform service data storage and integration method and system
CN112396114A (en) * 2020-11-20 2021-02-23 中国科学院深圳先进技术研究院 Evaluation system, evaluation method and related product
CN112507170A (en) * 2020-12-01 2021-03-16 平安医疗健康管理股份有限公司 Data asset directory construction method based on intelligent decision and related equipment thereof
CN112801233A (en) * 2021-04-07 2021-05-14 杭州海康威视数字技术股份有限公司 Internet of things equipment honeypot system attack classification method, device and equipment

Similar Documents

Publication Publication Date Title
US11797838B2 (en) Efficient convolutional network for recommender systems
CN106934410A (en) The sorting technique and system of data
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN107292350A (en) The method for detecting abnormality of large-scale data
Utari et al. Implementation of data mining for drop-out prediction using random forest method
CN106528874A (en) Spark memory computing big data platform-based CLR multi-label data classification method
CN101807254A (en) Implementation method for data characteristic-oriented synthetic kernel support vector machine
US20120109865A1 (en) Using affinity measures with supervised classifiers
Yu et al. Decision tree modeling for ranking data
AlMana et al. An overview of inductive learning algorithms
CN113326377A (en) Name disambiguation method and system based on enterprise incidence relation
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN105808582A (en) Parallel generation method and device of decision tree on the basis of layered strategy
CN107066328A (en) The construction method of large-scale data processing platform
Dahiya et al. A rank aggregation algorithm for ensemble of multiple feature selection techniques in credit risk evaluation
CN107193940A (en) Big data method for optimization analysis
Jha et al. Criminal behaviour analysis and segmentation using k-means clustering
Zhang et al. Research on borrower's credit classification of P2P network loan based on LightGBM algorithm
Bakhtyar et al. Freight transport prediction using electronic waybills and machine learning
Zeng et al. Decision tree classification model for popularity forecast of Chinese colleges
CN109784354A (en) Based on the non-parametric clustering method and electronic equipment for improving classification effectiveness
Woma et al. Comparisons of community detection algorithms in the YouTube network
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
CN107103095A (en) Method for computing data based on high performance network framework
Rajkumar et al. A critical study and analysis of journal metric ‘CiteScore’cluster and regression analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707

RJ01 Rejection of invention patent application after publication