CN108009571A - A kind of semi-supervised data classification method of new direct-push and system - Google Patents

A kind of semi-supervised data classification method of new direct-push and system Download PDF

Info

Publication number
CN108009571A
CN108009571A CN201711141009.4A CN201711141009A CN108009571A CN 108009571 A CN108009571 A CN 108009571A CN 201711141009 A CN201711141009 A CN 201711141009A CN 108009571 A CN108009571 A CN 108009571A
Authority
CN
China
Prior art keywords
label
matrix
data
training
soft
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711141009.4A
Other languages
Chinese (zh)
Inventor
贾磊
张召
张莉
王邦军
李凡长
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201711141009.4A priority Critical patent/CN108009571A/en
Publication of CN108009571A publication Critical patent/CN108009571A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of semi-supervised data classification method of new direct-push and system, unsupervised sub-space feature is learnt, differentiate that cluster and adaptive semisupervised classification are seamlessly integrated into a unified frame, low-dimensional epidemiological features and differentiation subspace clustering result based on initial data carry out semi-supervised learning, represent and classify available for high dimensional data, based on above-mentioned conjunctive model, figure construction is also seamlessly combined with label communication process, and this makes it possible to obtain the adaptive weighting coefficient matrix based on low dimensional manifold feature and the soft class label without label data.

Description

A kind of semi-supervised data classification method of new direct-push and system
Technical field
The present invention relates to a kind of semi-supervised data classification method of new direct-push and system, belong to data mining and computer Vision technique field.
Background technology
With computer technology and intelligentized continuous development, what is produced in our daily life and communication is most of Truthful data is typically due to lack identification information (such as category information) and be not easy to distinguish.In addition, the labeling process of data is also Costly and time-consuming, obtaining all data labels using full measure of supervision needs very big expense.Thus, can use in recent years The semi-supervised learning method that a small amount of both flag data and a large amount of Unlabeled datas are combined has caused more and more wide hairs Concern.Therefore how to efficiently use a small amount of label information raising nicety of grading that has is the problem of needing further investigated.
In recent years, it is a large amount of to assume to be suggested with the popular semi-supervised method based on figure assumed according to cluster, to solve number According to expression and classification problem.Propagated recently as the label of the semisupervised classification method based on figure due to its validity and quick Calculating speed and cause the concern of academia.It is by the elder generation of flag data based on the similarity relationships between data that label, which is propagated, Test the process that information travels to Unlabeled data by data similarity relation.It is harmonious that typical label transmission method includes Gaussian field Wave function, local and globally consistent inquiry learning, linear neighborhood propagation etc..It is worth noting that, nearly all existing conversion mark Sign transmission method all there may be it is potential once the shortcomings that.First, all it is that execution label is pre- after rights to independence weight building process Survey, this cannot ensure that constructed figure weight is propagated for subsequent label and estimation is optimal.Second, in existing research In, the neighborhood information of each data is usually determined by using k nearest neighbor or epsilon neighborhood.But fixed neighbour's number K or ε is usual It is different for each sample, i.e., it is not adaptive.In addition, the complex distributions of different real data, actually selection are suitable When Neighborhood Number K or radius of a ball ε also be difficult to.3rd, existing label propagation model defines weight based on original high dimensional data. But the higher-dimension sample of most of real worlds includes unfavorable feature, incoherent feature, noise, or even serious damage, This may directly result in the similarity measurement and prediction result of inaccuracy.
Therefore it provides a kind of with Feature Dimension Reduction and more robust sorting technique is to reduce expense, it is this area skill Art personnel's urgent problem to be solved.
The content of the invention
The goal of the invention of the present invention is to provide a kind of semi-supervised data classification method of new direct-push and system, for overcoming The problem of data label expense is big is obtained in the prior art.
To achieve the above object of the invention, the technical solution adopted by the present invention is:A kind of semi-supervised data of new direct-push point Class method, including:
(1), raw data set is pre-processed, random has been divided into the raw data set label training set and without mark Training set is signed, according to having label training set and define initial labels matrix without label training set, and the initialization for completing parameter is set Put, wherein without the sample to be tested that label training set is unknown classification;
(2), based on the initial labels matrix, by the study of unsupervised sub-space feature, cluster and adaptively semi-supervised point are differentiated Class is seamlessly integrated into a unified frame, using the frame combine popular feature learning with differentiate K mean cluster, Adaptive weighting constructs and label is propagated and estimation, obtains non-linear low-dimensional epidemiological features, is then carried out at the same time feature and soft mark Reconstructed error joint is signed to minimize;
(3), the frame is minimized using the optimization method of iteration and solved, soft class label matrix is obtained, based on described soft Maximum determines the corresponding classification information of the sample to be tested in class label matrix, obtains most accurate classification results.
Preferably, carrying out pretreatment in step (1) to raw data set includes following detailed process:By original sample number Being divided into one according to set has label training set and one without label training set, i.e.,(wherein, n is The dimension of data, l are the quantity of marked training sample, and u is unmarked training samples number), wherein including c (c>2) it is a The training sample set of class labelWith the training sample set without any labelWherein l+u=N, defines initial labels matrix Y.
Preferably, step (2) middle frame is:
Wherein, Z=[z1,z2,...,zN] represented for low-dimensional is popular;To cluster indicator matrix;G=[g1, g2,...,gc] it is cluster centre;It is soft label vector;μiRepresent xiAdjusting parameter, when in training set xiLabel known to when, corresponding μi=+∞, otherwise μi=0.It is initial labels matrix;Represent weight sparse matrix, w (i, j) represents the similarity of xi and xj;α and β is balance parameter;It is epidemiological features reconstruct item,Reconstructed error, adaptive classification error areChanged based on above-mentioned frame, obtain the frame of following matrix form:
Wherein ZZT=I is constraints.
Present invention also offers a kind of semi-supervised data sorting system of new direct-push, including:
Training pretreatment module, for being pre-processed to raw data set, it is random initial data is divided into training set and Test set, and init Tag matrix is defined, complete the initialization of parameter;
Training module, for based on the initial labels matrix, unsupervised sub-space feature being learnt, differentiation clusters and adaptive Semisupervised classification is seamlessly integrated into a unified frame, and popular feature learning combine with differentiating K using the frame Mean cluster, adaptive weighting construction and label are propagated and estimated, are obtained non-linear low-dimensional epidemiological features, are then carried out at the same time spy Soft label reconstructed error joint of seeking peace minimizes;
Test module, minimizes the frame for the optimization method using iteration and solves, obtain soft class label matrix, base Maximum determines the corresponding classification information of the sample to be tested in the soft class label matrix, obtains knot of most accurately classifying Fruit.
In the present invention, combine improved differentiation K mean cluster and epidemiological features learning process and obtain non-linear low-dimensional prevalence Feature, and low-dimensional epidemiological features learning process effectively removes redundancy, noise and the heterogeneous data included in initial data Deng disturbing factor;And then feature and soft label reconstructed error joint minimum are carried out at the same time based on manifold feature space, can be accurate Really obtain the soft class label matrix without label data and complete classification and determine, while the adaptive weighting coefficient that can ensure that It is optimal to represent and classify for data.And then the further renewal of manifold feature is carried out based on adaptive weighting coefficient.
In the present invention, by the above-mentioned minimum process of iteration optimization Scheme Solving in step (3), finally obtain one it is optimal Soft class label matrix and an optimal self-adapting reconstruction weight coefficient matrix, nothing is determined according to maximum in label matrix The corresponding classification information of label data sample, obtains most accurate classification results.
In the present invention, epidemiological features study is carried out using the unified type coalition framework of proposition, adaptive weighting learns and mark Label are propagated, so as to complete to be predicted unlabeled exemplars data in test set, obtain soft label matrix, are specially:
Said frame minimum is solved, due to including five variables at the same time, the present invention uses the optimum ideals of iteration, final to obtain To each soft label vector f without label training samplei, the corresponding position of greatest member of gained vector is without label training The belonging kinds label of sample, the hard label each without label training sample can be summed up as argmaxi≤c(fi)i, wherein(fi)iRepresent the soft label vector f of predictioniI-th of element position.
Since above-mentioned technical proposal is used, the present invention has following advantages compared with prior art:
1. the invention discloses a kind of semi-supervised data classification method of new direct-push and system, by unsupervised sub-space feature Practise, differentiate that cluster and adaptive semisupervised classification are seamlessly integrated into a unified frame, the low-dimensional stream based on initial data Row feature and differentiation subspace clustering result carry out semi-supervised learning, represent and classify available for high dimensional data, based on above-mentioned Molding type, figure construction are also seamlessly combined with label communication process, and this makes it possible to obtain based on the adaptive of low dimensional manifold feature Weight coefficient matrix and the soft class label without label data.
2. the method for the present invention combines improved differentiation K mean cluster first and the acquisition of epidemiological features learning process is non-linear low Epidemiological features are tieed up, and low-dimensional epidemiological features learning process effectively removes the redundancy included in initial data, noise and different The disturbing factors such as class data, therefore, feature based space is carried out at the same time feature and soft label reconstructed error joint minimizes, can Accurately obtain the soft class label matrix without label data and complete classification and determine, while the adaptive weighting system that can ensure that It is optimal that number, which is represented and classified for data, in addition, carrying out adaptive weighting construction based on epidemiological features, also be can effectively avoid The difficult problem of the parameters such as neighbour's quantity selection.
Brief description of the drawings
Fig. 1 is a kind of semi-supervised data classification method flow chart of new direct-push disclosed by the embodiments of the present invention;
Fig. 2 is the structure chart of the semi-supervised data sorting system of new direct-push disclosed by the embodiments of the present invention.
Embodiment
The invention will be further described with reference to the accompanying drawings and embodiments:
Embodiment one:
The present invention is tested in three data sets of UCI machine learning databases:Ionosphere and Balacce scale And SCCTS.Wherein Ionosphere includes 351 classifications, 34 attributes;Balacce scale include 132 samples, 3 Classification;SCCTS includes 600 samples, 6 classifications.Every group of experiment chooses 1 to 9 training samples successively from each database, and Observe classification accuracy.These databases are collected from many aspects, thus test result is with universal illustrative.
Attached drawing 1 is referred to, is a kind of semi-supervised data classification method flow of new direct-push disclosed by the embodiments of the present invention Figure, specific implementation step are:
A kind of semi-supervised data classification method of new direct-push, including:
(1), raw data set is pre-processed, random has been divided into the raw data set label training set and without mark Training set is signed, according to having label training set and define initial labels matrix without label training set, and the initialization for completing parameter is set Put, wherein without the sample to be tested that label training set is unknown classification;
Raw data set is divided into one has label training set and one without label training set, is embodied as(wherein, n is the dimension of data, and l is the quantity of marked training sample, and u is unmarked trained sample This quantity), wherein including c (c>2) training sample set of a class labelWith without any mark The training sample set of labelWherein l+u=N, defines initial labels matrix Y.
(2), based on the initial labels matrix, by the study of unsupervised sub-space feature, differentiate that cluster and adaptive half is supervised Superintend and direct classification and be seamlessly integrated into a unified frame, popular feature learning combine with differentiating K averages using the frame Cluster, adaptive weighting construction and label propagate and estimation, obtain non-linear low-dimensional epidemiological features, be then carried out at the same time feature and Soft label reconstructed error joint minimizes;
After step 1 obtains relevant parameter initialization and training set and test set, following object function is arrived in foundation as follows:
Wherein, Z=[z1,z2,...,zN] represented for low-dimensional is popular;To cluster indicator matrix;G=[g1, g2,...,gc] it is cluster centre;It is soft label vector;μiRepresent xiAdjusting parameter, work as training set Middle xiLabel known to when, corresponding μi=+∞, otherwise μi=0.It is initial labels matrix;Represent weight sparse matrix, w (i, j) represents the similarity of xi and xj;α and β is balance parameter;It is epidemiological features reconstruct item,Reconstructed error, adaptive classification error are
Based on the matrix expression having pointed out, the above problem can be rewritten as:
Wherein ZZT=I is constraints.Due to including five variables at the same time, when minimize solution to said frame, this hair The bright optimum ideals using iteration, by alternately updating.
Fixed first F, W and H, propose to be updated G and Z, can be summarized as optimizing equation below:
Wherein,Item is reconstructed for neighbour,Error is clustered to differentiate.α is adjustment parameter, to above-mentioned formula Carry out seeking G local derviations:
It is 0 to above formula right end value, can obtain the iteration more new formula of H, is specially:
Solution to variable Z, can in generation, returns to G object functions and can obtain again by above formula:
Using feature decomposition method to α (I-W) (I-W)T+(I-H(HTH)-1HT) decomposed, and preceding d minimum values are taken, it is denoted as Z, the low-dimensional for being initial data represent that d is embedded in subspace dimension for low-dimensional.After variable Z and H are obtained, cluster can be indicated Device H is solved, and H is defined as follows:
HLTo have the cluster indicator matrix of label data, HUFor the cluster indicator of corresponding no label data.Above procedure It can be considered semi-supervised K averages.
Low-dimensional is popular represent that Z and cluster indicator H is solved to obtain and fixed after, to adaptive weight coefficient matrix W into Row solves renewal, and particular problem can be summarized as optimizing equation below:
It should be noted that when solving its dependent variable before, W is initialized using reconstruct power is locally linear embedding into.W local derviations are sought above formula Following expression can be obtained:
It is 0 by above formula right end value, finally can obtain the iteration more new formula of W, is specially:
Work as H, G, Z and W all to solve after obtaining, to predicting that soft label solves, particular problem can switch to such as minor function most Smallization:
Above formula is carried out to seek F local derviations, and value is 0, finally can obtain the iteration more new formula of F:
Ft+1=YU (β (I-Wt)(I-Wt)T+U)-1
Finally, because H, G, Z and W are the functions on F, institute to five mutual iteration of variable in this way by making target letter Number is effectively solved, and finally draws soft label F and prediction result.
Specific algorithm is as follows:
A kind of semi-supervised data classification method algorithm of new direct-push
Input:Raw data matrixControl parameter α, β, initial labels matrix Y;
Initialization:F=Y;Sparse weight matrix W values are initialized to be locally linear embedding into reconstruct power;
When not converged:
1) calculates cluster centre G, fixed W, F, Z and H renewal Gt+1:
2) calculates Z, and Eigenvalues Decomposition is carried out to following formula, and as a result ascending order arranges, d entry value before taking:
α(I-W)(I-W)T+(I-H(HTH)-1HT);
3) fixes W, F, G and Z renewal cluster oriental matrix H:
4) fixes the soft label matrix W of F, G, H and Z renewalt+1:
5) fixes W, G, H and Z renewal Ft+1:
Ft+1=YU (β (I-Wt)(I-Wt)T+U)-1
Check whether convergence:
If sqrt (sum (F (:).2)) < tol | | iter >=maxIter then stop;
Otherwise t=t+1
Output:Soft label matrix (F*←Ft+1)。
(3), the frame is minimized using the optimization method of iteration and solved, soft class label matrix is obtained, based on institute State maximum in soft class label matrix and determine the corresponding classification information of the sample to be tested, obtain most accurate classification results.
Solved to proposing that frame is minimized using the optimum ideals of iteration, finally obtain the soft of each no label training sample Label vector fi, the corresponding position of greatest member of gained vector is the belonging kinds label without label training sample, each The hard label of no label training sample can be summed up as argmaxi≤c(fi)i, wherein (fi)iRepresent the soft label vector of prediction fiI-th of element position.
Method is described in detail in the invention described above disclosed embodiment, can be taken various forms for the method for the present invention System realize, therefore the invention also discloses a kind of system, specific embodiment is given below and is described in detail.
Attached drawing 2 is referred to, is the system knot of the semi-supervised data classification method of new direct-push disclosed by the embodiments of the present invention Composition.The system specifically includes:
Training pretreatment module 201, pre-processes raw data set, and random has been divided into initial data label training Collect and without label training set, and define initial labels matrix, complete the initialization of parameter;
Original sample data sets have been divided into label training set and one without label training set, are represented by(wherein, n is the dimension of data, and l is the quantity of marked training sample, and u is unmarked trained sample This quantity), wherein including c (c>2) training sample set of a class labelWith without any mark The training sample set of labelWherein l+u=N, defines initial labels matrix Y.
Training module 202, for based on the initial labels matrix, by the study of unsupervised sub-space feature, differentiating cluster A unified frame is seamlessly integrated into adaptive semisupervised classification, carries out combining popular feature learning using the frame With differentiating K mean cluster, adaptive weighting construction and label propagation and estimation, non-linear low-dimensional epidemiological features, Ran Houtong are obtained Shi Jinhang features and soft label reconstructed error joint minimize.
Unsupervised sub-space feature is learnt, differentiates that cluster and adaptive semisupervised classification are seamlessly integrated into a unification Frame, low-dimensional epidemiological features based on initial data and differentiate that subspace clustering result carries out semi-supervised learning, available for height Dimension data represents and classification.Combine improved differentiation K mean cluster and epidemiological features learning process first and obtain non-linear low-dimensional Epidemiological features, and low-dimensional epidemiological features learning process effectively removes redundancy, noise and the foreign peoples included in initial data The disturbing factors such as data.And then feature and soft label reconstructed error joint minimum are carried out at the same time based on manifold feature space, The soft class label matrix of no label data can accurately be obtained and complete classification and determined, while the adaptive weighting that can ensure that It is optimal that coefficient, which is represented and classified for data,.And then based on adaptive weighting coefficient carry out manifold feature further more Newly.
Training module 202 is established after pretreatment module 201 obtains relevant parameter initialization and training set and test set Following object function is arrived as follows:
Wherein, Z=[z1,z2,...,zN] represented for low-dimensional is popular;To cluster indicator matrix;G= [g1,g2,...,gc] it is cluster centre;It is soft label vector;μiRepresent xiAdjusting parameter, work as training Concentrate xiLabel known to when, corresponding μi=+∞, otherwise μi=0.It is initial labels matrix;Represent weight sparse matrix, w (i, j) represents the similarity of xi and xj;α and β is balance parameter;It is epidemiological features reconstruct item,Reconstructed error, adaptive classification error are
Based on the matrix expression having pointed out, the above problem can be rewritten as:
Wherein ZZT=I is constraints.Due to including five variables at the same time, when minimize solution to said frame, this hair The bright optimum ideals using iteration, by alternately updating.
Fixed first F, W and H, propose to be updated G and Z, can be summarized as optimizing equation below:
Wherein,Item is reconstructed for neighbour,For subspace clustering error.α is adjustment parameter, to above-mentioned public affairs Formula carries out seeking G local derviations:
It is 0 to above formula right end value, can obtain the iteration more new formula of H, is specially:
Solution to variable Z, can in generation, returns to G object functions and can obtain again by above formula:
Using feature decomposition method to α (I-W) (I-W)T+(I-H(HTH)-1HT) decomposed, and preceding d minimum values are taken, it is denoted as Z, the low-dimensional for being initial data represent that d is embedded in subspace dimension for low-dimensional.After variable Z and H are obtained, cluster can be indicated Device H is solved, and H is defined as follows:
HLTo have the cluster indicator matrix of label data, HUFor the cluster indicator of corresponding no label data.Above procedure It can be considered semi-supervised K averages.
Low-dimensional is popular represent that Z and cluster indicator H is solved to obtain and fixed after, to adaptive weight coefficient matrix W into Row solves renewal, and particular problem can be summarized as optimizing equation below:
It should be noted that when solving its dependent variable before, W is initialized using reconstruct power is locally linear embedding into.W local derviations are sought above formula Following expression can be obtained:
It is 0 by above formula right end value, finally can obtain the iteration more new formula of W, is specially:
Work as H, G, Z and W all to solve after obtaining, to predicting that soft label solves, particular problem can switch to such as minor function most Smallization:
Above formula is carried out to seek F local derviations, and value is 0, finally can obtain the iteration more new formula of F:
Ft+1=YU (β (I-Wt)(I-Wt)T+U)-1
Finally, because H, G, Z and W are the functions on F, institute to five mutual iteration of variable in this way by making target letter Number is effectively solved, and finally draws soft label F and prediction result.
Specific algorithm is as follows:
A kind of self-adaptive direct pushing-type sorting technique algorithm of union feature study with differentiating cluster
Input:Raw data matrixControl parameter α, β, initial labels matrix Y;
Initialization:F=Y;Sparse weight matrix W values are initialized to be locally linear embedding into reconstruct power;
When not converged:
1) calculates cluster centre G, fixed W, F, Z and H renewal Gt+1:
2) calculates Z, and Eigenvalues Decomposition is carried out to following formula, and as a result ascending order arranges, d entry value before taking:
α(I-W)(I-W)T+(I-H(HTH)-1HT);
3) fixes W, F, G and Z renewal cluster oriental matrix H:
4) fixes the soft label matrix W of F, G, H and Z renewalt+1:
5) fixes W, G, H and Z renewal Ft+1:
Ft+1=YU (β (I-Wt)(I-Wt)T+U)-1
Check whether convergence:
If sqrt (sum (F (:).2)) < tol | | iter >=maxIter then stop;
Otherwise t=t+1
Output:Soft label matrix (F*←Ft+1)。
Test module 203, minimizes the frame for the optimization method using iteration and solves, obtain soft class label Matrix, determines the corresponding classification information of the sample to be tested, it is most accurate to obtain based on maximum in the soft class label matrix Classification results.
Above-mentioned minimum solves, and finally obtains the soft label vector f of each no label training samplei, gained vector is most The big corresponding position of element is the belonging kinds label without label training sample, and the hard label each without label training sample can To be summed up as argmaxi≤c(fi)i, wherein (fi)iRepresent the soft label vector f of predictioniI-th of element position.
Table 1 is referred to, is the method for the present invention and SLP, LNP, LLGC, LapLDA, GFHF and CD-LNP method recognition result Contrast table, gives the average and highest discrimination of each method experiment.In this example, the method for participating in comparing uses each document The acquiescence optimized parameter that middle algorithm uses.The present invention is tested in three data sets of UCI machine learning databases: Ionosphere and Balaccescale and SCCTS.Wherein Ionosphere includes 351 classifications, 34 attributes; Balaccescale includes 132 samples, 3 classifications;SCCTS includes 600 samples, 6 classifications.Every group of experiment is from every number Choose 1 to 9 training samples successively according to storehouse.
The present invention of table 1. and the contrast of SLP, LNP, LLGC, LapLDA, GFHF and CD-LNP method recognition result
In conclusion the invention discloses a kind of semi-supervised data classification method of new direct-push and system, unsupervised son is empty Between feature learning, differentiate cluster and adaptive semisupervised classification is seamlessly integrated into a unified frame, based on initial data Low-dimensional epidemiological features and differentiate that subspace clustering result carries out semi-supervised learning, represent and classify available for high dimensional data.Base In above-mentioned conjunctive model, figure construction is also seamlessly combined with label communication process, and this makes it possible to obtain based on low dimensional manifold feature Adaptive weighting coefficient matrix and soft class label without label data.Specifically, the method for the present invention is combined improved first Differentiation K mean cluster and the non-linear low-dimensional epidemiological features of epidemiological features learning process acquisition, and low-dimensional epidemiological features learning process Effectively remove the disturbing factors such as redundancy, noise and the heterogeneous data included in initial data.Therefore, feature based is empty Between be carried out at the same time feature and soft label reconstructed error joint minimizes, can accurately obtain the soft class label matrix of no label data And complete classification and determine, while it is optimal that the adaptive weighting coefficient that can ensure that, which is represented and classified for data,.In addition, Adaptive weighting construction is carried out based on epidemiological features, also can effectively avoid the difficult problem of the parameters such as neighbour's quantity selection.

Claims (4)

  1. A kind of 1. semi-supervised data classification method of new direct-push, it is characterised in that including:
    (1), raw data set is pre-processed, random has been divided into the raw data set label training set and without mark Training set is signed, according to having label training set and define initial labels matrix without label training set, and the initialization for completing parameter is set Put, wherein without the sample to be tested that label training set is unknown classification;
    (2), based on the initial labels matrix, by the study of unsupervised sub-space feature, cluster and adaptively semi-supervised point are differentiated Class is seamlessly integrated into a unified frame, using the frame combine popular feature learning with differentiate K mean cluster, Adaptive weighting constructs and label is propagated and estimation, obtains non-linear low-dimensional epidemiological features, is then carried out at the same time feature and soft mark Reconstructed error joint is signed to minimize;
    (3), the frame is minimized using the optimization method of iteration and solved, soft class label matrix is obtained, based on described soft Maximum determines the corresponding classification information of the sample to be tested in class label matrix, obtains most accurate classification results.
  2. 2. the semi-supervised data classification method of new direct-push according to claim 1, its spy are, to original in step (1) Beginning data set, which carries out pretreatment, includes following detailed process:Original sample data sets are divided into one label training set With one without label training set, i.e.,Wherein, n is the dimension of data, and l is marked training sample Quantity, u are unmarked training samples numbers), wherein including c (c>2) training sample set of a class labelWith the training sample set without any labelWherein L+u=N, defines initial labels matrix Y.
  3. 3. the semi-supervised data classification method of new direct-push according to claim 1, its spy are, step (2) middle frame For: Wherein, Z=[z1,z2,...,zN] represented for low-dimensional is popular;To cluster indicator matrix;G=[g1, g2,...,gc] it is cluster centre;It is soft label vector;μiRepresent xiAdjusting parameter, when in training set xiLabel known to when, corresponding μi=+∞, otherwise μi=0,It is initial labels matrix;Represent weight sparse matrix, w (i, j) represents the similarity of xi and xj;α and β is balance parameter;It is epidemiological features reconstruct item,Reconstructed error, adaptive classification error areChanged based on above-mentioned frame, obtain the frame of following matrix form:
    Wherein ZZT=I is constraints.
  4. A kind of 4. semi-supervised data sorting system of new direct-push, it is characterised in that including:
    Training pretreatment module, for being pre-processed to raw data set, random has been divided into initial data label instruction Lian Ji and without label training set, and init Tag matrix is defined, complete the initialization of parameter;
    Training module, for based on the initial labels matrix, unsupervised sub-space feature being learnt, differentiation clusters and adaptive Semisupervised classification is seamlessly integrated into a unified frame, and popular feature learning combine with differentiating K using the frame Mean cluster, adaptive weighting construction and label are propagated and estimated, are obtained non-linear low-dimensional epidemiological features, are then carried out at the same time spy Soft label reconstructed error joint of seeking peace minimizes;
    Test module, minimizes the frame for the optimization method using iteration and solves, obtain soft class label matrix, base Maximum determines the corresponding classification information of the sample to be tested in the soft class label matrix, obtains knot of most accurately classifying Fruit.
CN201711141009.4A 2017-11-16 2017-11-16 A kind of semi-supervised data classification method of new direct-push and system Pending CN108009571A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711141009.4A CN108009571A (en) 2017-11-16 2017-11-16 A kind of semi-supervised data classification method of new direct-push and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711141009.4A CN108009571A (en) 2017-11-16 2017-11-16 A kind of semi-supervised data classification method of new direct-push and system

Publications (1)

Publication Number Publication Date
CN108009571A true CN108009571A (en) 2018-05-08

Family

ID=62052636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711141009.4A Pending CN108009571A (en) 2017-11-16 2017-11-16 A kind of semi-supervised data classification method of new direct-push and system

Country Status (1)

Country Link
CN (1) CN108009571A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829472A (en) * 2018-12-24 2019-05-31 陕西师范大学 Semisupervised classification method based on probability neighbour
CN110648355A (en) * 2019-09-29 2020-01-03 中科智感科技(湖南)有限公司 Image tracking method, system and related device
CN110895705A (en) * 2018-09-13 2020-03-20 富士通株式会社 Abnormal sample detection device, training device and training method thereof
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN111680644A (en) * 2020-06-11 2020-09-18 天津大学 Video behavior clustering method based on deep space-time feature learning
CN113705635A (en) * 2021-08-11 2021-11-26 西安交通大学 Semi-supervised width learning classification method and equipment based on adaptive graph
CN114343674A (en) * 2021-12-22 2022-04-15 杭州电子科技大学 Combined judgment subspace mining and semi-supervised electroencephalogram emotion recognition method
CN114418039A (en) * 2022-03-30 2022-04-29 浙江大学 Heterogeneous classifier aggregation method for improving classification fairness

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895705B (en) * 2018-09-13 2024-05-14 富士通株式会社 Abnormal sample detection device, training device and training method thereof
CN110895705A (en) * 2018-09-13 2020-03-20 富士通株式会社 Abnormal sample detection device, training device and training method thereof
CN109829472A (en) * 2018-12-24 2019-05-31 陕西师范大学 Semisupervised classification method based on probability neighbour
CN109829472B (en) * 2018-12-24 2024-05-14 陕西师范大学 Semi-supervised classification method based on probability nearest neighbor
CN111027582B (en) * 2019-09-20 2023-06-27 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN110648355A (en) * 2019-09-29 2020-01-03 中科智感科技(湖南)有限公司 Image tracking method, system and related device
CN111680644B (en) * 2020-06-11 2023-03-28 天津大学 Video behavior clustering method based on deep space-time feature learning
CN111680644A (en) * 2020-06-11 2020-09-18 天津大学 Video behavior clustering method based on deep space-time feature learning
CN113705635A (en) * 2021-08-11 2021-11-26 西安交通大学 Semi-supervised width learning classification method and equipment based on adaptive graph
CN114343674A (en) * 2021-12-22 2022-04-15 杭州电子科技大学 Combined judgment subspace mining and semi-supervised electroencephalogram emotion recognition method
CN114343674B (en) * 2021-12-22 2024-05-03 杭州电子科技大学 Combined discrimination subspace mining and semi-supervised electroencephalogram emotion recognition method
CN114418039A (en) * 2022-03-30 2022-04-29 浙江大学 Heterogeneous classifier aggregation method for improving classification fairness

Similar Documents

Publication Publication Date Title
CN108009571A (en) A kind of semi-supervised data classification method of new direct-push and system
CN109086658B (en) Sensor data generation method and system based on generation countermeasure network
CN109766277A (en) A kind of software fault diagnosis method based on transfer learning and DNN
CN109933670A (en) A kind of file classification method calculating semantic distance based on combinatorial matrix
CN109581339B (en) Sonar identification method based on automatic adjustment self-coding network of brainstorming storm
CN111753918B (en) Gender bias-removed image recognition model based on countermeasure learning and application
CN103714261A (en) Intelligent auxiliary medical treatment decision supporting method of two-stage mixed model
CN113344044B (en) Cross-species medical image classification method based on field self-adaption
CN116644755B (en) Multi-task learning-based few-sample named entity recognition method, device and medium
CN108762503A (en) A kind of man-machine interactive system based on multi-modal data acquisition
CN112732921A (en) False user comment detection method and system
CN112183652A (en) Edge end bias detection method under federated machine learning environment
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN108388918B (en) Data feature selection method with structure retention characteristics
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN106448660A (en) Natural language fuzzy boundary determining method with introduction of big data analysis
El Gannour et al. Improving skin diseases prediction through data balancing via classes weighting and transfer learning
CN116452895B (en) Small sample image classification method, device and medium based on multi-mode symmetrical enhancement
CN112101473A (en) Smoke detection algorithm based on small sample learning
CN116720106A (en) Self-adaptive motor imagery electroencephalogram signal classification method based on transfer learning field
CN116521863A (en) Tag anti-noise text classification method based on semi-supervised learning
CN114998731A (en) Intelligent terminal navigation scene perception identification method
CN115310491A (en) Class-imbalance magnetic resonance whole brain data classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180508

RJ01 Rejection of invention patent application after publication