CN109086453A

CN109086453A - A kind of method and system for extracting label correlation from neighbours' example

Info

Publication number: CN109086453A
Application number: CN201810991693.3A
Authority: CN
Inventors: 施展; 冯丹; 杨蕾; 戴凯航; 方交凤; 刘上; 曹孟媛; 杨文鑫; 陈硕; 陈静
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2018-12-25

Abstract

The invention discloses a kind of from neighbours' example extracts the method and system of label correlation, wherein the realization of method includes: to be clustered according to feature to example sample；Find K neighbours' sample similar in feature；Obtain the tag set of K neighbours' sample；Calculate the label percentage set C of neighbours' sample；With decision tree prediction label correlation；The distributed intelligence of comprehensive tally set and decision tree prediction label l_kExisting confidence score.The case where present invention extracts label correlation from neighbours' example, considers the relationship between all labels, label in fractional sample is occurred in pairs regards feature as, obtains the confidence score of label according to the feature of label correlation.The present invention can effectively improve the accuracy rate of multi-tag classification.

Description

A kind of method and system for extracting label correlation from neighbours' example

Technical field

The invention belongs to data minings and machine learning field, extract mark from neighbours' example more particularly, to one kind Sign the method and system of correlation.

Background technique

With the arrival of big data era, real-life every field produces a large amount of multi-tag data, accurately It obtains sample label and is conducive to improve the hit rate of text retrieval, picture retrieval, object identification, and face the number of explosive growth According to, the valuable information of manual withdrawal is increasingly difficult to complete, had become by the automatic sample drawn label of machine learning method based on Want direction.

Data classification is an important branch in the field of data mining research, is the important aspect to solve practical problems, It receives significant attention and studies, traditional classification method is that each sample is assigned to one and only one label.However it is true The object in the world often not only has unique semanteme, but has ambiguity, in the fields such as text mining and bioinformatics In research object be all multi-tag, such as in text classification, a document may belong to multiple types；Yeast genes function Can classify is also multi-tag classification problem, and yeast data set is made of 1500 genes, and a gene has multiple functions label； In medical diagnosis, a kind of disease may belong to multiple classifications.Traditional single labeling is unable to satisfy demand, multi-tag classification Become research emphasis.

When to multi-tag sample classification, traditional multi-tag classification method is the mapping between learning sample feature and label Relationship can predict to have no exemplary class label in the mapping, and without considering the relationship between label, label is often pairs of Occur, statistics show they have correlation, from the perspective of study and prediction, these relationships provide except essential information with Outer useful information, therefore it is beneficial to consider that label correlation promotes the accuracy rate of algorithm.Consider that the correlation of label is more, mould The complexity of type is higher, if only considering that part labels correlation will be unable to capture true dependence, if it is considered that all The complex relationship of correlation, label is more intractable.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides one kind extracts label from neighbours' example The method and system of correlation, thus solving existing multi-tag sample classification mode has the technical issues of certain limitation.

To achieve the above object, according to one aspect of the present invention, it provides one kind and extracts label phase from neighbours' example The method of closing property, comprising:

(1) similitude between each example sample is measured by the Euclidean distance between each example sampling feature vectors, with Each example sample is clustered according to close and distant distance, wherein the label of the example sample after cluster is with uniformity or phase Association；

(2) it for arbitrary target example, is found from the example sample after cluster and k neighbour similar in object instance feature Example sample is occupied, to obtain the tag set of k neighbours' example sample of object instance；

(3) the label percentage set C={ c of k neighbours' example sample is calculated_l1,c_l2,c_l3,…,c_lm, wherein c_lmIt is Contain the label percentage of m-th of label according to the object instance that the tag set of k neighbours' example sample obtains, m indicates mark Sign number；

(4) classifier is established using label percentage as the feature of example sample, constructs the topological diagram of label importance, from Top is formed down decision tree；

(5) distributed intelligence of comprehensive tag set and decision tree predict the label for whether having in tag set in object instance Confidence score.

Preferably, in step (2), it is minimum that the Euclidean distance between object instance is found from the example sample after cluster K neighbours' example sample, to obtain the tag set of k neighbours' example sample of object instance.

Preferably, in step (3), object instance contains the label percentage c of m-th of label_lmAre as follows:Wherein, Y_j(j=1,2 ..., k) is the tally set of j-th of neighbour in object instance k neighbour It closes, I_Yj(l_m) indicate object instance j-th of neighbour tag set in whether contain label l_mIf label l_m∈Y_j, thenOtherwise,

Preferably, in step (4), the input space of the classifier is the label percentage set C=of neighbours' sample {c_l1,c_l2,c_l3,…,c_lm, corresponding output space is t={ 0,1 }, judges whether there is tally set in example sample with decision tree Label in conjunction, and if it exists, then t value is 1, and then t is 0 if it does not exist.

It is another aspect of this invention to provide that providing a kind of system for extracting label correlation from neighbours' example, comprising:

Cluster module is measured similar between each example sample by the Euclidean distance between each example sampling feature vectors Property, to be clustered to each example sample according to close and distant distance, wherein the label of the example sample after cluster it is with uniformity or Person is associated；

Tag set obtains module, for for arbitrary target example, finds from the example sample after cluster and target K neighbours' example sample similar in example aspects, to obtain the tag set of k neighbours' example sample of object instance；

Label percentage computing module, for calculating the label percentage set C={ c of k neighbours' example sample_l1,c_l2, c_l3,…,c_lm, wherein c_lmIt is that the object instance obtained according to the tag set of k neighbours' example sample contains m-th of label Label percentage, m indicate label number；

Decision tree constructs module, for establishing classifier for label percentage as the feature of example sample, constructs label The topological diagram of importance, top-down formation decision tree；

Prediction module, for whether having tally set in integrating the distributed intelligence and decision tree prediction object instance of tag set The confidence score of label in conjunction.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect: the present invention mainly finds similar neighbours' sample, excavates label from the tag set of the similar sample of small cluster The case where occurring in pairs, as label correlative character, obtains the confidence score of label according to the feature of label correlation, realizes Prediction to multi-tag improves the accuracy rate of classification.

Detailed description of the invention

Fig. 1 is a kind of process signal of method that label correlation is extracted from neighbours' example provided in an embodiment of the present invention Figure.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

The present invention considers partial tag dependence, proposes a kind of side that label correlation is obtained from neighbours' example Method considers that similar sample local under actual conditions has correlation, by the label occurred in pairs in fractional sample tally set point Cloth information regards feature as, and low with complexity and can be parallel single labeling method calculates the probability that label occurs.The present invention Similar neighbours' sample is mainly found, the feelings that label occurs in pairs are excavated from the tag set of the similar sample of small cluster Condition obtains the confidence score of label according to the feature of label correlation, realizes to the pre- of multi-tag as label correlative character It surveys, improves the accuracy rate of classification.

It is as shown in Figure 1 a kind of method flow schematic diagram provided in an embodiment of the present invention, comprising the following steps:

(1) example sample is clustered according to feature: special by calculating each example sample using the method for Euclidean distance Euclidean distance between sign vector measures the similitude between each example sample, to cluster to each example sample, wherein with Sample similar in example aspects can regard the cluster of example similar in feature as；

(2) for arbitrary target example, the sample label set of k neighbours of object instance is obtained, is found and target reality K neighbours' sample similar in example feature, to obtain the tag set of k neighbours' sample of object instance；

Wherein, the label distributed intelligence in cluster labels set is related to the object instance sample label, with same at high proportion When existing label there is correlation.

Wherein, the smallest k neighbours example of the Euclidean distance between object instance is found from the example sample after cluster Sample, to obtain the tag set of k neighbours' example sample of object instance.

In embodiments of the present invention, it can be used two to recirculate to calculate label percentage set C, outer loop is used to time Go through tag set L={ l₁,l₂,l₃,...,l_m, interior loop is used to traverse each in k neighbour's example sample of object instance The label that example sample is possessed.

In embodiments of the present invention, object instance contains the label percentage c of m-th of label_lmAre as follows:Wherein, Y_j(j=1,2 ..., k) is the label of j-th of neighbour in the k neighbour of object instance Set,Indicate whether contain label l in the tag set of j-th of neighbour of object instance_mIf label l_m∈Y_j, thenOtherwise,

(4) use decision tree prediction label correlation: the sample label percentage that step (3) is obtained is as example sample Feature establishes classifier, constructs the topological diagram of label importance, top-down formation decision tree is established shared by different labels Percentage is characterized, and objective function is the classifier for the label whether object instance possesses in tag set；

Wherein, the input space of classifier is the label percentage set C={ c of neighbours' sample_l1,c_l2,c_l3,…,c_lm, Corresponding output space is t_i={ 0,1 } judges example sample x with decision tree_iIn whether have label in tag set, if depositing In then t_iValue is 1, if it does not exist then t_iIt is 0.

(5) distributed intelligence of comprehensive tag set and decision tree prediction object instance possess setting for the label in tag set Believe score.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of method for extracting label correlation from neighbours' example characterized by comprising

(1) similitude between each example sample is measured by the Euclidean distance between each example sampling feature vectors, to each Example sample is clustered according to close and distant distance, wherein the label of the example sample after cluster is with uniformity or associated；

(2) it for arbitrary target example, is found from the example sample after cluster real with k neighbours similar in object instance feature Example sample, to obtain the tag set of k neighbours' example sample of object instance；

(3) the label percentage set C={ c of k neighbours' example sample is calculated_l1,c_l2,c_l3,…,c_lm, wherein c_lmIt is according to k The object instance that the tag set of a neighbours' example sample obtains contains the label percentage of m-th of label, and m indicates label Number；

(4) establish classifier using label percentage as the feature of example sample, construct the topological diagram of label importance, push up certainly to Lower formation decision tree；

(5) distributed intelligence of comprehensive tag set and decision tree predict the confidence containing the label in tag set in object instance Score.

2. the method according to claim 1, wherein being found from the example sample after cluster in step (2) The smallest k neighbours example sample of Euclidean distance between object instance, to obtain k neighbours' example sample of object instance Tag set.

3. method according to claim 1 or 2, which is characterized in that in step (3), object instance contains m-th of label Label percentage c_lmAre as follows:Wherein, Y_j(j=1,2 ..., k) is the jth of object instance k neighbour The tag set of a neighbour,Indicate whether contain label l in the tag set of j-th of neighbour of object instance_mIf mark Sign l_m∈Y_j, thenOtherwise,

4. according to the method described in claim 3, it is characterized in that, the input space of the classifier is neighbour in step (4) Occupy the label percentage set C={ c of sample_l1,c_l2,c_l3,…,c_lm, corresponding output space is t={ 0,1 }, uses decision tree Judge the label for whether having in tag set in example sample, and if it exists, then t value is 1, and then t is 0 if it does not exist.

5. providing a kind of system for extracting label correlation from neighbours' example characterized by comprising

Cluster module measures the similitude between each example sample by the Euclidean distance between each example sampling feature vectors, To be clustered to each example sample according to close and distant distance, wherein the label of the example sample after cluster it is with uniformity or It is associated；

Tag set obtains module, for for arbitrary target example, finds from the example sample after cluster and object instance K neighbours' example sample similar in feature, to obtain the tag set of k neighbours' example sample of object instance；

Decision tree constructs module, and for establishing classifier for label percentage as the feature of example sample, construction label is important The topological diagram of property, top-down formation decision tree；

Prediction module, distributed intelligence and decision tree for integrating tag set are predicted in object instance containing in tag set The confidence score of label.