CN107316063A

CN107316063A - Multiple labeling sorting technique, device, medium and computing device

Info

Publication number: CN107316063A
Application number: CN201710493622.6A
Authority: CN
Inventors: 翁伟; 朱顺痣; 钟瑛; 李建敏
Original assignee: Xiamen University of Technology
Current assignee: Xiamen University of Technology
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2017-11-03

Abstract

The application is related to machine learning techniques field, more particularly to multiple labeling sorting technique, device, medium and equipment.In the embodiment of the present application, after the original positive example collection and original minus example collection that obtain each mark, alignd by class, determine the operation of particular community and the particular community of insertion mark of correlation, realize with particular community to represent the dependency relation between mark, in order to enrich the data and semanteme of each mark.So, multiple labeling classification merely will be more accurate using the method for single mark relative to prior art.For example, " desert " and " camel " has dependency relation, the picture for containing a small amount of desert based on camel can be categorized into the picture of desert；For another example the lake water for the dusk that a pictures are included, if having the inverted image of the setting sun in lake water, prior art only can be by the picture classification into lake water, but the inverted image of the sun is again related to the setting sun in lake water, then, can also be by the classification of the picture classification to dusk scenery using the scheme of the application.

Description

Multiple labeling sorting technique, device, medium and computing device

Technical field

The application is related to machine learning techniques field, more particularly to multiple labeling sorting technique, device, medium and calculating are set It is standby.

Background technology

Multiple labeling problem is widely present in machine learning.For example in image labeling problem, if given " canoe ", " water ", " mountain peak ", " bridge ", " pedestrian ", " setting sun ", " cloud " etc. are marked, and the picture of a secondary description riverside scenery can be marked with One or more of these marks.For another example in gene function classification, a gene can be with " energy ", " metabolism " Deng for representing that the mark of functional category is related.Because mark quantities is big, handmarking is slow due to speed, so using artificial The method of mark is unpractical.So, it is particularly important that research and utilization computer technology carries out automatic multiple labeling classification.

In correlation technique, one needs the object (abbreviation multiple labeling object) of mark, commonly uses attribute vector and label vector To describe.Wherein, attribute vector describes the multiple labeling Properties of Objects, and label vector describes which mark it possesses.Specifically , mark is more to be represented using the vector being made up of " -1 " and "+1 ", and " -1 " represents that multiple labeling object does not have correspondence markings, and "+1 " represents there is correspondence markings.

Although people classify to multiple labeling the research of a period of time, multiple labeling classification how is carried out so far still The problem of being so one extremely challenging.Comparatively, traditional single mark Study on Problems achievement is more, method comparative maturity. If multiple labeling problem is simply seen be multiple single mark problems combination, this method effect is often not fully up to expectations.One Important the reason for, is that this method have ignored the relation between not isolabeling.And the relation between marking is that mark prediction can Utilize important information.For example, for containing " desert ", the picture library of " camel " the two marks, certain pictures has " husky The mark of desert ", then be likely to have with " camel " this mark.Occur because " desert " and " camel " is often common, have Positive correlation.Therefore, it is a science multiple labeling classifying quality how to be improved using there is dependency relation between multiple marks Boundary and industrial circle very concern.

The content of the invention

The embodiment of the present application provides multiple labeling sorting technique, device, medium and computing device, to solve in the prior art Exist simply see multiple labeling problem is that multiple lists mark the combination of problems to carry out multiple labeling classification, causes classification results Inaccurate etc. the problem of.

A kind of multiple labeling sorting technique that the embodiment of the present application is provided, including：

For each mark in tag set, the original positive example collection and original minus example collection of the mark are determined；Wherein, for Each sample, if the sample has the mark, the sample belongs to the original positive example collection of the mark, and otherwise, the sample belongs to this The original minus example collection of mark；

Class alignment is carried out respectively to original positive example collection and original minus the example collection of each mark, after the class alignment for obtaining each mark Negative example collection after positive example collection and class alignment；Wherein, the positive example after the class alignment respectively marked concentrates sample size equal and each mark Class alignment after negative example concentrate sample size it is equal；

According to predetermined cluster centre number, the positive example collection after each class alignment is determined based on clustering method The cluster centre of negative example collection after cluster centre, and each class alignment；

For each mark, the original positive example collection and original minus example for calculating the mark concentrate each sample relative to the mark Each cluster centre distance, will obtain being used as mark specified genus corresponding with respective sample after arranged in sequence Property, and constituted by element of the particular community of each sample of the mark particular community set of the mark；

For each mark, the particular community of other marks with the mark with dependency relation is inserted into the mark In particular community set；

Particular community set based on each mark, carries out classification based training.

Another embodiment of the application also provides a kind of multiple labeling sorter, and the device includes：

Positive example bears example collection determining module, for for each mark in tag set, determining the original positive example of the mark Collection and original minus example collection；Wherein, for each sample, if the sample has the mark, the sample belongs to the original of the mark Positive example collection, otherwise, the sample belong to the original minus example collection of the mark；

Class alignment module, carries out class alignment for original positive example collection and original minus the example collection to each mark, obtains each respectively The negative example collection after positive example collection and class alignment after the class alignment of mark；Wherein, the positive example after the class alignment respectively marked concentrates sample Quantity it is equal and respectively mark class alignment after negative example concentrate sample size it is equal；

Cluster centre determining module, for according to predetermined cluster centre number, being determined based on clustering method The cluster centre of negative example collection after the cluster centre of positive example collection after each class alignment, and each class alignment；

Particular community determining module, for for each mark, calculating the original positive example collection and original minus example collection of the mark In each sample relative to each cluster centre of the mark distance, will obtain after arranged in sequence as the mark with The corresponding particular community of respective sample, and the specified genus of the mark is constituted using the particular community of each sample of the mark as element Property set；

Data-optimized module, for for each mark, by the specific of other marks with the mark with dependency relation Attribute is inserted into the particular community set of the mark；

Classification based training module, for the particular community set based on each mark, carries out classification based training.

Another embodiment of the application additionally provides a kind of computing device, and it includes memory and processor, wherein, it is described to deposit Reservoir is instructed for storage program, and the processor is used to call the programmed instruction stored in the memory, according to acquisition Programmed instruction performs any multiple labeling sorting technique in the embodiment of the present application.

Another embodiment of the application additionally provides a kind of computer-readable storage medium, wherein, the computer-readable storage medium is deposited Computer executable instructions are contained, the computer executable instructions are used to make the computer perform in the embodiment of the present application Any multiple labeling sorting technique.

In the embodiment of the present application, after the original positive example collection and original minus example collection that obtain each mark, alignd by class, it is determined that special Determine the operation of the particular community of attribute and insertion mark of correlation, realize with particular community to represent the related pass between mark System, in order to enrich the data and semanteme of each mark.So, multiple labeling classification is relative to prior art merely using single mark Method will be more accurate.For example, " desert " and " camel " has dependency relation, the picture in a small amount of desert will be contained based on camel It can be categorized into the picture of desert；For another example the lake water for the dusk that a pictures are included, if having the inverted image of the setting sun in lake water, Prior art only can by the picture classification into lake water, but in lake water the sun inverted image again it is related to the setting sun, then using the application Scheme, can also be by the classification of the picture classification to dusk scenery.

Brief description of the drawings

Fig. 1 is the multiple labeling sorting technique schematic flow sheet that the embodiment of the present application one is provided；

Fig. 2 is the structural representation for the multiple labeling sorter that the embodiment of the present application three is provided；

Fig. 3 is the structural representation for the computing device that the embodiment of the present application four is provided.

Embodiment

The embodiment of the present application provides multiple labeling sorting technique, device, medium and computing device.There is provided in the embodiment of the present application Multiple labeling sorting technique in, obtain each mark original positive example collection and original minus example collection after, alignd by class, determine specified genus Property and insertion mark of correlation particular community operation, realize with particular community represent mark between dependency relation, with It is easy to enrich the data and semanteme of each mark.So, multiple labeling classification is relative to prior art merely using the method for single mark To be more accurate.For example, " desert " and " camel " has dependency relation, can by the picture containing a small amount of desert based on camel It is categorized into the picture of desert；For another example the lake water for the dusk that a pictures are included, existing if having the inverted image of the setting sun in lake water Technology only can by the picture classification into lake water, but in lake water the sun inverted image again it is related to the setting sun, then using the side of the application Case, can also be by the classification of the picture classification to dusk scenery.

The embodiment of the present application is described in further detail with reference to Figure of description.

Embodiment one

Reference picture 1, be the embodiment of the present application one provide multiple labeling sorting technique schematic flow sheet figure, this method include with Lower step：

Step 101：For each mark in tag set, the original positive example collection and original minus example collection of the mark are determined； Wherein, for each sample, if the sample has the mark, the sample belongs to the original positive example collection of the mark, otherwise, the sample Originally the original minus example collection of the mark is belonged to.

Step 102：Class alignment is carried out respectively to original positive example collection and original minus the example collection of each mark, the class of each mark is obtained The negative example collection after positive example collection and class alignment after alignment；Wherein, respectively mark class alignment after positive example concentrate sample size it is equal, And the negative example after the class alignment respectively marked concentrates sample size equal.

For example, having 2 marks, respectively mark 1 and mark 2.Wherein, the sample number that the original positive example of mark 1 is concentrated The sample number concentrated for the original positive example of 10, mark 2 is 8.Then class align when due to mark 2 original positive example concentrate sample number It is few, then increase is needed by the positive example sample of mark 1, so that mark 1 is equal with the sample size that the original positive example of mark 2 is concentrated.

The purpose and method that original minus example collection carries out class alignment are repeated no more here with original positive example collection.Hereinafter will citing Class alignment is described in detail, wouldn't be repeated again.

Step 103：According to predetermined cluster centre number, determined based on clustering method after each class alignment The cluster centre of negative example collection after the cluster centre of positive example collection, and each class alignment.

Wherein, clustering method can according to the actual requirements, using the method such as k-means clusters point of prior art Analysis method, the application is not construed as limiting to this.

Wherein, in one embodiment, cluster centre number, including step A1- steps can be determined according to following methods A3：

Step A1：Determine class alignment after positive example collection sample number alignd with class after negative example collection sample number minimum Value.

Step A2：The product of default control variable and the minimum value determined is calculated, and product is carried out after floor operation The number of cluster centre is obtained, wherein, default control variable is the constant more than 0 and less than 1.

Above-mentioned steps A1- steps A3 can be represented with adopting by equivalent equation below (1)：

In formula (1), c represents cluster centre number；R represents default control variableRepresent the positive example after class alignment The sample number of collection；Represent the sample number of the negative example collection after class alignment；Expression takes the two sample numbers Minimum value.

Because class alignment has been carried out, to any mark, once if r values are consistent, c value is also just identical.Therefore This, so determines that the method for cluster centre number is easy and easily performs, it is possible to increase treatment effeciency.

Certainly, when it is implemented, cluster centre number, the application can also be determined using the method for other prior arts This is not construed as limiting.

Step 104：For each mark, the original positive example collection and original minus example for calculating the mark concentrate each sample relative In the distance of each cluster centre of the mark, it will obtain corresponding as the mark with respective sample after arranged in sequence Particular community, and constituted by element of the particular community of each sample of the mark particular community set of the mark.

If being for the obtained cluster centres of mark kWhereinRepresent class The cluster centre of positive example collection after alignment,Represent the cluster centre of the negative example collection after class alignment.

Then, can be by the attribute transfer function shown in formula (2)To obtain marking l_kEach sample x_iIt is specific Attribute：

In formula (2),Represent mark l_kWith sample x_iCorresponding particular community.

Here d () function representation layback,Exemplified by, represent sample x_iWith cluster centreAway from From.When it is implemented, the distance can be Euclidean distance.

Like this, all samples are all changed, and obtain l_kParticular community set can use equation below (3) table Show：

In formula (3), D represents data set, and the data set is represented by D={ (x_i,y_i) | 1≤i≤N }, wherein x_tFor I-th of sample, is represented, y with the attribute vector of the sample_tThe mark of the sample, N represents number of samples.

Step 105：For each mark, the particular community of other marks with the mark with dependency relation is inserted into In the particular community set of the mark.

Step 106：Particular community set based on each mark, carries out classification based training.

When it is implemented, corresponding two-value grader can be respectively trained.Conventional two-value grader have SVMs, Decision tree etc., can select, the application is not construed as limiting according to particular problem.

In summary, being alignd by class, it is identical with sample number holding various in negative example sample by positive example sample to realize, To determine the basic data that uniform amount is provided during particular community later.By determining particular community, there will be correlation to realize The mark of relation sets up incidence relation, and the data and semanteme of the particular community of abundant mark, so that being based on enriching number According to more accurate with the training result of semantic sample.

Following for the technical scheme for being easy to further understand the application offer, by following (1)-(2) point to phase above Step is closed to be described further.

(1), for each mark described in step 105, by the specific of other marks with the mark with dependency relation Attribute is inserted into the particular community set of the mark, may particularly include following steps B1- steps B3：

Step B1：For specifying sample, sample and multiple neighbours' samples of the specified sample are specified to constitute one by this The corresponding neighbours' sample set of sample is specified with this.

Wherein, in one embodiment, for specifying sample, it can determine that this specifies sample corresponding according to following methods Neighbours' sample set：

Step B11：Calculate the sample difference that other samples specify sample with this.

Wherein, sample difference is used to represent the gap between sample, can determine meter during specific implementation according to specific sample The method for calculating sample difference.For example, sample is the distance of a certain distance location reference point, then sample difference, can with two samples away from Deviation is represented.For another example sample is the color value of pixel, then the sample difference of two pixels can use the aberration of two samples. So, it is defined in the embodiment of the present application without the circular to sample difference.

Step B12：The sample for choosing default neighbours' sample size according to the order of sample difference from small to large is specified as this Neighbours' sample of sample.

For example, the sample difference of 5 samples and specified sample is respectively 1,2,3,4,5.If default neighbours' sample size is 3, It is respectively that 1,2,3 sample specifies neighbours' sample of sample as this then to choose sample difference.

In addition, neighbours' sample can also be determined using the position relationship between sample during specific implementation.For example, in image If specifying the pixel of sample one, then 4 neighborhoods or 8 neighborhoods of the pixel can be selected as neighbours' sample.

Step B2：In each neighbours' sample set of multiple neighbours' sample sets, for each mark, the mark and its are determined It marks the frequency of positive example simultaneously as same sample as the co-occurrence frequency of the mark；And determine that this is marked at neighbours' sample The maximum co-occurrence frequency of this concentration.

Step B3：If the maximum co-occurrence frequency is more than designated value, by other marks that the co-occurrence frequency with the mark is maximum The particular community of the specified sample corresponding with neighbours' sample set of note is inserted into the particular community set of the mark.

Wherein, in one embodiment, designated value could be arranged to 0.As long as namely in neighbours' sample set, maximum is altogether Existing frequency does not illustrate there is dependency relation between two marks for 0.

For example, specifying the corresponding neighbours' sample set M ' of sample M.In M ', for mark l₂, mark l₂With l₁Co-occurrence Frequency is maximum and is not 0, then l particular communitys corresponding with specified sample M is added into l₁Particular community set in.

When it is implemented, the accuracy in order to improve co-occurrence frequency calculating, can determine the co-occurrence according to below equation Frequency：

Wherein, i represents to specify sample；l_jRepresent the mark of co-occurrence frequency to be determined；l_kRepresent another mark p (i, j, k) Represent the l in the corresponding neighbours' sample sets of specified sample i_jWith l_kCo-occurrence frequency.

(2), class pair is carried out respectively for original positive example collection and original minus the example collection to each mark in step 102 Together, it may particularly include following steps C1- steps C2：

Step C1：Determine that each original positive example concentrates sample number maximum, and for each mark, if the mark it is original just The sample number of example collection is less than the sample number maximum, then the sample that the original positive example to the mark is concentrated carries out resampling and obtained just Example sample, and by positive example sample be added to the original positive example of the mark concentrate obtain class alignment after positive example collection.

Step C2：Determine that each original minus example concentrates Maximum sample size, and for each mark, if the original minus example of the mark The sample number of collection is less than the Maximum sample size, then the sample that the original minus example to the mark is concentrated carries out resampling and obtains negative example sample Originally, and the negative example sample is added to the original minus of mark example concentration and obtains the negative example collection after class alignment.

Wherein, when it is implemented, step C1 and step C2 execution sequence are unrestricted.

By step C1 and step C2, class can be achieved when aliging by resampling.Method is simple, can be applicable each The sample of type.

When it is implemented, the sample that the original positive example to mark is concentrated, which carries out resampling, obtains positive example sample, can specifically it wrap Include：The sample for choosing the first specified sample size is concentrated from the original positive example of the mark, and determines that the average for the sample chosen is made The positive example sample obtained for resampling；

The sample that original minus example to mark is concentrated carries out resampling and obtains negative example sample, may particularly include：

Concentrated from the original minus of mark example and choose second and specify the sample of sample size, and the sample for determining to choose is equal The negative example sample that value is obtained as resampling.

Wherein, first sample size and second is specified to specify sample size can be the same or different, the application is to this It is not construed as limiting.The average of sample can be represented using the average of the attribute vector of sample.

For ease of further understanding, the multiple labeling sorting technique provided below by embodiment two the application makees further Explanation.

Embodiment two

It is as shown in table 1 multiple labeling data set, one has 6 samples in the table, respectively x₁,x₂,...,x₆, label sets It is combined into { l₁,l₂}。

The sample of table 1 and its mark having

1st step：Class is alignd：

Pass through statistical form 1, l₁Original positive example collection beIts original minus example, which collects, is l₂Original positive example collection beIts original minus example, which collects, is

Obviously,(i.e. l₁Positive example sample number for 2),ThereforeIn order to realize pair WithCarry out class alignment, it is only necessary to increase a positive example and arriveIn go it is all right.L may be selected₁In 2 positive example x₂And x₃Next life Into positive example (x₂+x₃)/2.Similarly, l in this example₂Negative example it is less, it is also desirable to class align, may be selected l₂2 negative example x₃And x₄Come Negative example (the x of generation₃+x₄)/2.Multiple samples are produced if desired, then using the method for circulating selection in turn.

Using the above method, the result carried out after class alignment is as shown in table 2.

The positive example collection and negative example collection of each mark after the alignment of the class of table 2

2nd step：Determine particular community

If c=1 (i.e. 1 cluster centre), the data for table 2 perform k-means clustering methods.For mark l₁ For, the cluster centre of both positive example collection and negative example collection after obtained class alignment

Similarly, to mark l₂For, obtain cluster centre

According to formula (2) and (3), the original positive example for calculating each mark concentrates the distance of each sample and corresponding cluster centre, Obtain marking l₁And l₂Respective particular community set is as shown in table 3.

The particular community set that table 3 is respectively marked

3rd step：Particular community is inserted

Assuming that specifying sample to be x₂, and k is set to 6 (element is 6 i.e. in neighbours' sample set), then i.e. x₂ Neighbours' sample set { x₁, x₂, x₃, x₄, x₅, x₆}.SoSince so,Calculate after co-occurrence frequency, it is known that R₂₁=2 (i.e. with sample l₁Co-occurrence frequency is most Big sample is l₂), R₂₂=1.So, for specifying sample x₂For, l₁And l₂There is local relation in this, accordingly to mark Particular community needs to be inserted into the particular community set of other side, and its result is as shown in table 4.

The result of the particular community of table 4 insertion

So, complete and mark of correlation data and semanteme are enriched by the dependency relation realization between mark.Next Particular community set can be just based on, sample training is carried out.

By test, method of the invention can effectively carry out multiple labeling classification, in following Hamming Loss, One- General effect is fine in error, Coverage, Ranking Loss, Average Precision indexs.

The embodiment of the present application sets up particular community set for not isolabeling, and binding marker relation is lifted on this basis The classification capacity of particular community set.In addition, most multiple labeling sorting techniques independently consider class uneven (i.e. class is not lined up), mark The problems such as note relation, particular community of mark, and sorting technique proposed by the invention considers at this 3 points and caused final minute Class result will be more accurate.

Embodiment three

Based on identical inventive concept, the embodiment of the present application also provides a kind of multiple labeling sorter, as shown in Fig. 2 being The structural representation of the device, including：

Positive example bears example collection determining module 201, for for each mark in tag set, determine the mark it is original just Example collection and original minus example collection；Wherein, for each sample, if the sample has the mark, the sample belongs to the original of the mark Beginning positive example collection, otherwise, the sample belong to the original minus example collection of the mark；

Class alignment module 202, carries out class alignment for original positive example collection and original minus the example collection to each mark, obtains respectively The negative example collection after positive example collection and class alignment after the class alignment respectively marked；Wherein, the positive example after the class alignment respectively marked concentrates sample This quantity it is equal and respectively mark class alignment after negative example concentrate sample size it is equal；

Cluster centre determining module 203, it is true based on clustering method for according to predetermined cluster centre number The cluster centre of positive example collection after fixed each class alignment, and each cluster centre of the negative example collection after class alignment；

Particular community determining module 204, for for each mark, calculating the original positive example collection and original minus example of the mark Concentrate each sample relative to the distance of each cluster centre of the mark, will obtain being used as the mark after arranged in sequence Particular community corresponding with respective sample, and constituted the specific of the mark using the particular community of each sample of the mark as element Attribute set；

Data-optimized module 205, for for each mark, by the spy of other marks with the mark with dependency relation Determine attribute to be inserted into the particular community set of the mark；

Classification based training module 206, for the particular community set based on each mark, carries out classification based training.

Wherein, in one embodiment, the data-optimized module, is specifically included：

Neighbours' sample set determining unit, for for specify sample, by this specify sample and this specify sample it is multiple Neighbours' sample constitutes neighbours' sample set corresponding with the specified sample；

Co-occurrence frequency determining unit, in each neighbours' sample set of multiple neighbours' sample sets, for each mark, Determine the mark and co-occurrence frequency of other frequencies for marking positive example simultaneously as same sample as the mark；And determine to be somebody's turn to do It is marked at the maximum co-occurrence frequency in neighbours' sample set；

Optimize unit, if being more than designated value for the maximum co-occurrence frequency, the co-occurrence frequency with the mark is maximum The particular community of the specified sample corresponding with neighbours' sample set of other marks is inserted into the particular community set of the mark.

Wherein, in one embodiment, the co-occurrence frequency is determined according to below equation：

Wherein, in one embodiment, described device also includes：

Neighbours' sample set determining module, for for specifying sample, determining that this specifies sample corresponding according to following methods Neighbours' sample set：

Calculate the sample difference that other samples specify sample with this；

The sample for choosing default neighbours' sample size according to the order of sample difference from small to large specifies the neighbour of sample as this Occupy sample.

Wherein, in one embodiment, the class alignment module, specifically for determining that each original positive example concentrates sample number most Big value, and for each mark, if the sample number of the original positive example collection of the mark is less than the sample number maximum, to the mark The sample concentrated of original positive example carry out resampling and obtain positive example sample, and positive example sample is added to the original positive example of the mark Concentrate and obtain the positive example collection after class alignment；And

Determine that each original minus example concentrates Maximum sample size, and for each mark, if the sample of the original minus example collection of the mark This number is less than the Maximum sample size, then the sample that the original minus example to the mark is concentrated carries out resampling and obtains negative example sample, and The negative example sample is added to the original minus of mark example concentration and obtains the negative example collection after class alignment.

Wherein, in one embodiment, class alignment module specifically for：

Concentrated from the original positive example of the mark and choose first and specify the sample of sample size, and the sample for determining to choose is equal The positive example sample that value is obtained as resampling；

Wherein, in one embodiment, described device also includes：

Cluster centre number determining module, for determining cluster centre number according to following methods：

Determine class alignment after positive example collection sample number alignd with class after negative example collection sample number minimum value；

The product of default control variable and the minimum value determined is calculated, and to being clustered after product progress floor operation The number at center, wherein, default control variable is the constant more than 0 and less than 1.

The inventive concept of said apparatus is identical with embodiment of the method, and wherein principle and beneficial effect are implemented referring also to method Example, therefore not to repeat here.

Example IV

The embodiment of the present application four additionally provides a kind of computing device, and the computing device is specifically as follows desktop computer, just Take formula computer, smart mobile phone, tablet personal computer, personal digital assistant (Personal Digital Assistant, PDA) etc.. As shown in figure 3, the computing device can include central processing unit (Center Processing Unit, CPU) 301, memory 302nd, input equipment 303, output equipment 304 etc., input equipment can include keyboard, mouse, touch-screen etc., and output equipment can be with Including display device, such as liquid crystal display (Liquid Crystal Display, LCD), cathode-ray tube (Cathode Ray Tube, CRT) etc..

Memory can include read-only storage (ROM) and random access memory (RAM), and provide storage to processor The programmed instruction and data stored in device.In the embodiment of the present application, memory can be used for storing multiple labeling sorting technique Programmed instruction.

Processor is by calling the programmed instruction of memory storage, and processor is used to perform according to the programmed instruction of acquisition： For each mark in tag set, the original positive example collection and original minus example collection of the mark are determined；Wherein, for each sample This, if the sample has the mark, the sample belongs to the original positive example collection of the mark, and otherwise, the sample belongs to the mark Original minus example collection；

Embodiment five

The embodiment of the present application five provides a kind of computer-readable storage medium, by saving as based on used in above-mentioned computing device Calculation machine programmed instruction, it includes the program for being used for performing above-mentioned multiple labeling sorting technique.

The computer-readable storage medium can be any usable medium or data storage device that computer can be accessed, bag Include but be not limited to magnetic storage (such as floppy disk, hard disk, tape, magneto-optic disk (MO)), optical memory (such as CD, DVD, BD, HVD etc.) and semiconductor memory it is (such as ROM, EPROM, EEPROM, nonvolatile memory (NAND FLASH), solid State hard disk (SSD)) etc..

Finally it should be noted that：Above example is only to the technical scheme for illustrating the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, it will be understood by those within the art that：It still may be used To be modified to the technical scheme described in foregoing embodiments, or equivalent substitution is carried out to which part technical characteristic； And these modification or replace, do not make appropriate technical solution essence depart from each embodiment technical scheme of the application spirit and Scope.

Claims

1. a kind of multiple labeling sorting technique, it is characterised in that methods described includes：

For each mark in tag set, the original positive example collection and original minus example collection of the mark are determined；Wherein, for each Sample, if the sample has the mark, the sample belongs to the original positive example collection of the mark, and otherwise, the sample belongs to the mark Original minus example collection；

Class alignment is carried out respectively to original positive example collection and original minus the example collection of each mark, the positive example after the class alignment of each mark is obtained Negative example collection after collection and class alignment；Wherein, the positive example after the class alignment respectively marked concentrates the class that sample size is equal and respectively marks Negative example after alignment concentrates sample size equal；

According to predetermined cluster centre number, the cluster of the positive example collection after each class alignment is determined based on clustering method The cluster centre of negative example collection behind center, and each class alignment；

For each mark, the original positive example collection and original minus example for calculating the mark concentrate each sample each relative to the mark The distance of cluster centre, will be obtained as mark particular community corresponding with respective sample after arranged in sequence, and The particular community set of the mark is constituted by element of the particular community of each sample of the mark；

For each mark, the particular community of other marks with the mark with dependency relation is inserted into the specific of the mark In attribute set；

2. according to the method described in claim 1, it is characterised in that described for each mark, will have to the mark related The particular community of other marks of relation is inserted into the particular community set of the mark, is specifically included：

For specifying sample, multiple neighbours' samples of sample and the specified sample are specified to constitute one and the specified sample by this Corresponding neighbours' sample set；

In each neighbours' sample set of multiple neighbours' sample sets, for each mark, determine the mark with other marks simultaneously As same sample positive example frequency as the mark co-occurrence frequency；And determine that this is marked in neighbours' sample set most Big co-occurrence frequency；

If the maximum co-occurrence frequency is more than designated value, by the maximum other marks of the co-occurrence frequency with the mark and the neighbours The particular community of the corresponding specified sample of sample set is inserted into the particular community set of the mark.

3. method according to claim 2, it is characterised in that the co-occurrence frequency is determined according to below equation：

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <msub> <mi>l</mi> <mi>j</mi> </msub> <mo>&RightArrow;</mo> </mover> <mover> <msub> <mi>l</mi> <mi>k</mi> </msub> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>l</mi> <mi>k</mi> </msub> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <msub> <mo>|</mo> <mn>1</mn> </msub> </mrow> </mfrac> </mrow>

Wherein, i represents to specify sample；l_jRepresent the mark of co-occurrence frequency to be determined；l_kRepresent that another mark p (i, j, k) is represented The l in the corresponding neighbours' sample sets of specified sample i_jWith l_kCo-occurrence frequency.

4. method according to claim 2, it is characterised in that methods described also includes：

For specifying sample, determine that this specifies the corresponding neighbours' sample set of sample according to following methods：

Calculate the sample difference that other samples specify sample with this；

The sample for choosing default neighbours' sample size according to the order of sample difference from small to large specifies neighbours' sample of sample as this This.

5. according to the method described in claim 1, it is characterised in that original positive example collection and the original minus example to each mark collects Class alignment is carried out respectively, is specifically included：

Determine that each original positive example concentrates sample number maximum, and for each mark, if the sample of the original positive example collection of the mark Number is less than the sample number maximum, then the sample that the original positive example to the mark is concentrated carries out resampling and obtains positive example sample, and Positive example sample is added to the original positive example of mark concentration and obtains the positive example collection after class alignment；And,

Determine that each original minus example concentrates Maximum sample size, and for each mark, if the sample number of the original minus example collection of the mark Less than the Maximum sample size, then the sample that the original minus example to the mark is concentrated carries out resampling and obtains negative example sample, and should Negative example sample is added to the original minus of mark example and concentrates the negative example collection obtained after class alignment.

6. method according to claim 5, it is characterised in that the sample that the original positive example to mark is concentrated carries out resampling Positive example sample is obtained, is specifically included：

The sample for choosing the first specified sample size is concentrated from the original positive example of the mark, and determines that the average for the sample chosen is made The positive example sample obtained for resampling；

The sample that original minus example to mark is concentrated carries out resampling and obtains negative example sample, specifically includes：

The sample for choosing the second specified sample size is concentrated from the original minus example of the mark, and determines that the average for the sample chosen is made The negative example sample obtained for resampling.

7. according to any described method in claim 1-6, it is characterised in that methods described also includes：

Cluster centre number is determined according to following methods：

The product of default control variable and the minimum value determined is calculated, and to obtaining cluster centre after product progress floor operation Number, wherein, default control variable for more than 0 and less than 1 constant.

8. a kind of multiple labeling sorter, it is characterised in that the device includes：

Positive example bears example collection determining module, for for each mark in tag set, determine the mark original positive example collection and Original minus example collection；Wherein, for each sample, if the sample has the mark, the sample belongs to the original positive example of the mark Collection, otherwise, the sample belong to the original minus example collection of the mark；

Class alignment module, carries out class alignment for original positive example collection and original minus the example collection to each mark, obtains each mark respectively Class alignment after positive example collection and class alignment after negative example collection；Wherein, the positive example after the class alignment respectively marked concentrates sample size Negative example after class that is equal and respectively marking is alignd concentrates sample size equal；

Cluster centre determining module, for according to predetermined cluster centre number, being determined based on clustering method each The cluster centre of negative example collection after the cluster centre of positive example collection after class alignment, and each class alignment；

Particular community determining module, for for each mark, the original positive example collection and original minus example for calculating the mark to concentrate every Individual sample relative to the distance of each cluster centre of the mark, will obtain after arranged in sequence as the mark with it is corresponding The corresponding particular community of sample, and constituted by element of the particular community of each sample of the mark particular community collection of the mark Close；

Data-optimized module, for for each mark, by the particular community of other marks with the mark with dependency relation It is inserted into the particular community set of the mark；

9. a kind of computing device, it is characterised in that including memory and processor, wherein, the memory is used for storage program Instruction, the processor is used to call the programmed instruction stored in the memory, is performed according to the programmed instruction of acquisition as weighed Profit requires 1~7 any described multiple labeling sorting technique.

10. a kind of computer-readable storage medium, it is characterised in that the computer-readable storage medium is stored with, and computer is executable to be referred to Order, the computer executable instructions are used to make the computer perform the multiple labeling classification as described in claim 1~7 is any Method.