CN113033641A - Semi-supervised classification method for high-dimensional data - Google Patents

Semi-supervised classification method for high-dimensional data Download PDF

Info

Publication number
CN113033641A
CN113033641A CN202110285595.XA CN202110285595A CN113033641A CN 113033641 A CN113033641 A CN 113033641A CN 202110285595 A CN202110285595 A CN 202110285595A CN 113033641 A CN113033641 A CN 113033641A
Authority
CN
China
Prior art keywords
matrix
subspace
sample
learning
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110285595.XA
Other languages
Chinese (zh)
Other versions
CN113033641B (en
Inventor
叶枫旭
余志文
陈俊龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110285595.XA priority Critical patent/CN113033641B/en
Publication of CN113033641A publication Critical patent/CN113033641A/en
Application granted granted Critical
Publication of CN113033641B publication Critical patent/CN113033641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • G06F18/21322Rendering the within-class scatter matrix non-singular
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • G06F18/21322Rendering the within-class scatter matrix non-singular
    • G06F18/21328Rendering the within-class scatter matrix non-singular involving subspace restrictions, e.g. nullspace techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semi-supervised classification method for high-dimensional data, and relates to the field of artificial intelligence semi-supervised learning. The method mainly overcomes the influence of data noise and redundant features on a model in high-dimensional data in the manufacturing industry, integrates subspace learning, graph construction and classifier training into a unified framework, and achieves a better classification effect. The method comprises the following steps: 1) inputting a training data set; 2) normalizing the data; 3) initializing parameters and variables; 4) subspace learning; 5) constructing a graph; 6) training a classifier; 7) repeating the steps 4) -6) circularly until the algorithm is converged; 8) classifying the test samples; 9) and obtaining the classification accuracy. The invention completes the construction of the graph from two low-dimensional spaces, namely the label space and the subspace, effectively relieves the interference of noise data and redundant features on an algorithm model, ensures the quality of the graph and improves the classification effect.

Description

Semi-supervised classification method for high-dimensional data
Technical Field
The invention relates to the technical field of artificial intelligence semi-supervised learning, in particular to a high-dimensional data semi-supervised classification method.
Background
With the advent of the intelligent era, part of the traditional manufacturing industry is gradually closing to intelligent manufacturing. Aiming at a large amount of data generated in the manufacturing industry, an intelligent decision method is applied to optimize the flows of production, sales, service and the like, and the method is one of the main problems faced by intelligent manufacturing. The manufacturing industry often accumulates large amounts of data in the process of development. However, in the general case, these large data are not all tagged. In the case of a large amount of data and a small amount of labels, if we want to use a fully supervised classification algorithm to perform modeling analysis on the data to learn some patterns of the data, a satisfactory effect cannot be achieved. Then, how can the intrinsic patterns of data be learned from a large amount of data and a small number of tags? One solution is to try to label massive training data, but this is expensive and requires a lot of manpower and material resources. Obviously, a better solution is to design an algorithm model directly starting from algorithms and models, so that the algorithm model can learn a classification model with better performance and strong generalization capability from data with only a few labels. And a semi-supervised classification algorithm is just such an algorithm model. The method utilizes a small amount of labeled samples and a large amount of unlabeled samples to learn and classify the data, thereby saving the expense of manually labeling training samples. Therefore, the semi-supervised classification algorithm has important research significance, attracts the research and exploration of the majority of scientific research personnel in recent years, and has good application prospect in industry.
The semi-supervised classification algorithm based on the graph is one of the popular research directions in the semi-supervised field in recent years, because the semi-supervised classification algorithm tends to have more excellent performance. Such algorithms are based on the assumption that the data should be in manifold space and the distribution of samples should be sufficiently smooth. By smooth, it is meant that the closer the samples, i.e. the more similar the samples, the labels should be as identical as possible. In such algorithms, a graph is usually constructed to represent similarity between samples, so as to obtain a smoothness term between samples, then a loss function, a regularization term and the smoothness term are combined together to serve as an overall objective function of a model, and classifier parameters are solved by optimizing the objective function, so that a finally trained classifier not only has small classification loss on labeled samples, but also has sufficiently smooth classification results on all samples (including labeled samples and unlabeled samples).
However, some current graph-based semi-supervised classification algorithms are not well suited for the context of high-dimensional data in manufacturing. For example, data in the manufacturing industry often has missing values and data noise, which may interfere with the construction of the graph and have a certain effect on the performance of the model. Another problem is that graph-based semi-supervised classification algorithms often do not perform well when processing high-dimensional data for manufacturing due to data noise and redundancy characteristics.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a high-dimensional data semi-supervised classification method, can effectively relieve the influence of data noise and redundancy characteristics in high-dimensional data on a model, integrates the construction process and the classifier training process of a graph into a unified framework, and obviously improves the classification effect under the semi-supervised classification scene.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a semi-supervised classification method for high-dimensional data comprises the following steps:
1) inputting a training data set which is a high-dimensional data set;
2) normalizing the data, eliminating the influence of different characteristic dimensions, and simultaneously improving the speed of subsequent optimization learning;
3) initializing a regression matrix
Figure BDA0002980321250000021
Subspace projection matrix
Figure BDA0002980321250000022
Wherein d is the number of features of the sample, c is the number of sample classes,
Figure BDA0002980321250000023
a real matrix representing d rows and c columns; initializing a low rank decomposition matrix of W
Figure BDA0002980321250000024
Wherein
Figure BDA0002980321250000025
A real number matrix representing c rows and c columns; initializing a similarity matrix
Figure BDA0002980321250000026
Parameter matrix
Figure BDA0002980321250000027
Where n is the number of samples in the sample,
Figure BDA0002980321250000028
a real number matrix representing n rows and n columns; initializing a bias vector
Figure BDA0002980321250000029
Wherein
Figure BDA00029803212500000210
A real matrix representing c rows and 1 columns;
4) and (3) subspace learning: deriving an optimal solution of a low-rank decomposition matrix B, a parameter matrix C and a subspace projection matrix A according to the proposed subspace learning objective function; because the proposed objective function relates to a plurality of optimization variables, an alternate optimization method is used for iteratively updating B, C, A, the subspace quality is gradually optimized, and the optimal subspace for expressing the essential characteristics of the sample is learned;
5) comprehensively learning a sample similarity matrix from two aspects of a sample subspace and a sample label space; defining samples as nodes of the graph, defining the similarity among the samples as edges of the graph, wherein the learning process of the sample similarity matrix is the construction process of the graph;
6) learning a semi-supervised linear regression classifier, namely learning a regression matrix W and a bias vector b, on the basis of the subspace learning in the step 4) and the similarity matrix learning in the step 5);
7) circularly performing the step 4) to the step 6), and iteratively learning each variable until convergence; when convergence occurs, joint optimal solutions are obtained through three processes of subspace learning, graph construction and classifier learning;
8) classifying the test samples, assuming that the input test sample is x and the number of sample classes is c, predicting (x) of the prediction label is as follows:
Figure BDA0002980321250000031
wherein (W)Tx+b)iRepresents a vector (W)Tx + b) th element;
9) calculating the classification accuracy: and inputting a label of the test sample, comparing the label with a prediction result, and calculating the final classification accuracy.
In step 2), the data normalization step is: obtaining the maximum value X (r) corresponding to the r row datamaxAnd minimum value X (r)minAnd converting the d-th row of data according to the following formula:
Figure BDA0002980321250000032
wherein the content of the first and second substances,
Figure BDA0002980321250000041
for the ith data of the r-th row,
Figure BDA0002980321250000042
for the data after update, n is the number of samples in the dataset, d is the number of features of the samples, i ∈ {1, 2.., n }, r ∈ {1, 2.., d }.
In step 3), the initialization method is as follows: initializing a regression matrix
Figure BDA0002980321250000043
Is an all-zero matrix; initializing a low rank decomposition matrix of W
Figure BDA0002980321250000044
Is an all-zero matrix; initializing a similarity matrix
Figure BDA0002980321250000045
And parameter matrix
Figure BDA0002980321250000046
Is an all-zero matrix; initializing a bias vector
Figure BDA0002980321250000047
Is an all-zero vector; initializing a subspace projection matrix a ═ qf (r) to be an orthogonal matrix; wherein the content of the first and second substances,
Figure BDA0002980321250000048
is a random matrix with each element in the interval 0,1]And qf (·) denotes QR decomposition.
In step 4), the subspace learning process is as follows:
the objective function defining the subspace learning is:
Figure BDA0002980321250000049
wherein tr (-) is the trace of the matrix,
Figure BDA00029803212500000410
the F-norm of the matrix is represented,
Figure BDA00029803212500000411
in the form of a matrix of samples,
Figure BDA00029803212500000412
a matrix of real numbers representing d rows and n columns,
Figure BDA00029803212500000413
in the form of a regression matrix,
Figure BDA00029803212500000414
for the projection matrix of the subspace,
Figure BDA00029803212500000415
is a low-rank decomposition matrix of W,
Figure BDA00029803212500000416
is a parameter matrix; alpha, thetaBeta is an adjustable parameter;
the target functions are subjected to partial derivation of B, C, A respectively, and updating formulas of all variables can be obtained; next, each variable is updated as required:
a. according to the formula B ═ ATW updates B;
b. according to the formula C ═ XTAATX+I)-1XTAATUpdating C by X;
c. circularly updating the subspace projection matrix A according to the following formula: a. thet+1=qf(At+ G) until convergence;
wherein, I is an identity matrix, t represents the t-th iteration, AtRepresents the value of A in the t-th iteration, At+1Denotes the value of A at iteration t +1, G denotes the gradient of the objective function, G ═ 2(X (. alpha.L + theta (I-C))T)XTA-βWBT) And qf (·) denotes QR decomposition.
In step 5), the construction process of the graph is as follows: the similarity matrix is jointly learned from two aspects of the sample label space and the sample subspace, and an objective function for defining the similarity matrix learning is as follows:
Figure BDA0002980321250000051
wherein tr (-) is the trace of the matrix,
Figure BDA0002980321250000052
the F-norm of the matrix is represented,
Figure BDA0002980321250000053
is a regression matrix, and the regression matrix is,
Figure BDA0002980321250000054
is a projection matrix of the subspace,
Figure BDA0002980321250000055
is a matrix of samples of the sample to be sampled,
Figure BDA0002980321250000056
a matrix of real numbers representing d rows and n columns,
Figure BDA0002980321250000057
is a matrix of the degree of similarity (or similarity matrix),
Figure BDA0002980321250000058
is a laplacian matrix and L ═ D-S, D is a diagonal matrix,
Figure BDA0002980321250000059
Diirepresenting the elements of matrix D at row i and column i, SijRepresenting the elements in the ith row and the jth column of the similarity matrix S; the parameter λ is the weight of the regularization term;
setting the number of neighbors of each sample as k, namely, the similarity between each sample and the k neighbor samples is not 0, and the similarity is 0; let xi,xjRespectively representing the ith and jth samples; definition eijIs xiAnd xjSum of Euclidean distance in subspace and Euclidean distance in tag space, then eijThe calculation formula of (a) is as follows:
Figure BDA00029803212500000510
then, according to the solution of the objective function, an update formula of the similarity matrix S can be obtained:
Figure BDA00029803212500000511
wherein the intermediate variable
Figure BDA00029803212500000512
In step 6), the learning process of the semi-supervised linear regression classifier is as follows:
the basic objective function that defines a semi-supervised linear regression classifier is:
Figure BDA00029803212500000513
where tr (-) is the trace of the matrix,
Figure BDA00029803212500000514
the F-norm of the matrix is represented,
Figure BDA00029803212500000515
is a regression matrix, and the regression matrix is,
Figure BDA00029803212500000516
is a matrix of samples of the sample to be sampled,
Figure BDA00029803212500000517
a matrix of real numbers representing d rows and n columns,
Figure BDA00029803212500000518
is a vector of the offset to the offset,
Figure BDA00029803212500000519
is the label matrix of the sample, the parameter gamma is the regularizing term weight,
Figure BDA00029803212500000520
is a diagonal matrix if sample xiIf it is a sample with a label, Uii1, otherwise Uii0, wherein UiiRepresents the element in the ith row and ith column of the matrix U;
combining the objective function with the objective function learned by the subspace in the step 4) and the objective function learned by the similarity matrix in the step 5) to obtain a final objective function:
Figure BDA0002980321250000061
wherein Loss ═ tr ((W)TX+b1T-Y)U(WTX+b1T-Y)T),
Figure BDA0002980321250000062
Is a matrix of parameters that is,
Figure BDA0002980321250000063
is a projection matrix of the subspace,
Figure BDA0002980321250000064
is a similarity matrix;
Figure BDA0002980321250000065
is a low-rank decomposition matrix of W,
Figure BDA0002980321250000066
is a laplacian matrix and L ═ D-S, D is a diagonal matrix,
Figure BDA0002980321250000067
Diirepresenting the elements of matrix D at row i and column i, SijRepresenting the elements in the ith row and the jth column of the similarity matrix S; the parameters alpha, theta and beta are weights for adjusting the importance degree of each item;
and respectively solving the partial derivatives of the final objective function to W and b to obtain the update formulas of W and b as follows:
W=[XUcXT+αXLXT+β(I-AAT)+γI]-1XUcYT
Figure BDA0002980321250000068
wherein the intermediate variable
Figure BDA0002980321250000069
And then, updating W and b according to the updating formula, and finishing the learning process of the semi-supervised linear regression classifier.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method has a solid mathematical theoretical basis and has great advantages of accuracy, stability and robustness. First, constructing a graph from two low-dimensional spaces, the label space and the subspace, together, can overcome the effects of redundant features in high-dimensional data in the manufacturing industry. And moreover, the graph constructed from the two spaces has higher robustness and better adapts to the characteristic of unstable data distribution. Second, in the process of learning the subspace, the low-rank property of the regression matrix is utilized, so that the subspace can distinguish different types of samples more easily. And thirdly, integrating the three processes of subspace learning, graph construction and classifier training into a unified framework, and achieving a joint optimal solution by means of cyclic alternating optimization and mutual promotion of the three processes, thereby remarkably improving the overall learning capability of the algorithm framework.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
FIG. 2 is a comparison table of the accuracy of the present invention compared with the conventional semi-supervised classification algorithm and the semi-supervised classification algorithm based on the graph, SSCNGC is the abbreviation of the method of the present invention, the number thickening is the best effect, and the data format is "accuracy + -standard deviation".
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the semi-supervised classification method for high-dimensional data provided by this embodiment includes the following steps:
1) and inputting a training data set which is a high-dimensional data set.
2) Normalizing the data, eliminating the influence of different characteristic dimensions, and simultaneously improving the speed of subsequent optimization learning; wherein, the data normalization step is: obtaining the maximum value X (r) corresponding to the r row datamaxAnd minimum value X (r)minAnd converting the d-th row of data according to the following formula:
Figure BDA0002980321250000071
wherein the content of the first and second substances,
Figure BDA0002980321250000072
for the ith data of the r-th row,
Figure BDA0002980321250000073
for the data after update, n is the number of samples in the dataset, d is the number of features of the samples, i ∈ {1, 2.., n }, r ∈ {1, 2.., d }.
3) Initializing a regression matrix
Figure BDA0002980321250000074
Subspace projection matrix
Figure BDA0002980321250000075
Wherein d is the number of features of the sample, c is the number of sample classes,
Figure BDA0002980321250000076
a real matrix representing d rows and c columns; initializing a low rank decomposition matrix of W
Figure BDA0002980321250000077
Wherein
Figure BDA0002980321250000078
A real number matrix representing c rows and c columns; initializing a similarity matrix
Figure BDA0002980321250000079
Parameter matrix
Figure BDA00029803212500000710
Where n is the number of samples in the sample,
Figure BDA00029803212500000711
a real number matrix representing n rows and n columns; initializing a bias vector
Figure BDA00029803212500000712
Wherein
Figure BDA00029803212500000713
A real matrix representing c rows and 1 columns;
the initialization method comprises the following steps: initializing a regression matrix
Figure BDA00029803212500000714
Is an all-zero matrix; initializing a low rank decomposition matrix of W
Figure BDA00029803212500000715
Is an all-zero matrix; initializing a similarity matrix
Figure BDA00029803212500000716
And parameter matrix
Figure BDA00029803212500000717
Is an all-zero matrix; initializing a bias vector
Figure BDA0002980321250000081
Is an all-zero vector; initializing a subspace projection matrix a ═ qf (r) to be an orthogonal matrix; wherein the content of the first and second substances,
Figure BDA0002980321250000082
is a random matrix with each element in the interval 0,1]And qf (·) denotes QR decomposition.
4) And (3) subspace learning: deriving an optimal solution of a low-rank decomposition matrix B, a parameter matrix C and a subspace projection matrix A according to the proposed subspace learning objective function; because the proposed objective function relates to a plurality of optimization variables, an alternate optimization method is used for iteratively updating B, C, A, the subspace quality is gradually optimized, and the optimal subspace for expressing the essential characteristics of the sample is learned;
the subspace learning process is as follows:
the objective function defining the subspace learning is:
Figure BDA0002980321250000083
wherein tr (-) is a matrixThe trace is a trace of the data to be written,
Figure BDA0002980321250000084
the F-norm of the matrix is represented,
Figure BDA0002980321250000085
is a sample matrix; alpha, theta, beta are adjustable parameters;
the target functions are subjected to partial derivation of B, C, A respectively, and updating formulas of all variables can be obtained; next, each variable is updated as required:
a. according to the formula B ═ ATW updates B;
b. according to the formula C ═ XTAATX+I)-1XTAATUpdating C by X;
c. circularly updating the subspace projection matrix A according to the following formula: a. thet+1=qf(At+ G) until convergence;
wherein, I is an identity matrix, t represents the t-th iteration, AtRepresents the value of A in the t-th iteration, At+1Denotes the value of A at iteration t +1, G denotes the gradient of the objective function, G ═ 2(X (. alpha.L + theta (I-C))T)XTA-βWBT) And qf (·) denotes QR decomposition.
5) Comprehensively learning a sample similarity matrix from two aspects of a sample subspace and a sample label space; defining samples as nodes of the graph, defining the similarity among the samples as edges of the graph, wherein the learning process of the sample similarity matrix is the construction process of the graph;
the construction process of the graph is as follows: the similarity matrix is jointly learned from two aspects of the sample label space and the sample subspace, and an objective function for defining the similarity matrix learning is as follows:
Figure BDA0002980321250000091
wherein the parameter λ is a weight of the regularization term;
let the number of neighbors of each sample be k, i.e., each sample has only a similarity with k neighbor samples other than0, and the others are all 0; let xi,xjRespectively representing the ith and jth samples; definition eijIs xiAnd xjSum of Euclidean distance in subspace and Euclidean distance in tag space, then eijThe calculation formula of (a) is as follows:
Figure BDA0002980321250000092
then, according to the solution of the objective function, an update formula of the similarity matrix S can be obtained:
Figure BDA0002980321250000093
wherein the intermediate variable
Figure BDA0002980321250000094
6) Learning a semi-supervised linear regression classifier, namely learning a regression matrix W and a bias vector b, on the basis of the subspace learning in the step 4) and the similarity matrix learning in the step 5);
the learning process of the semi-supervised linear regression classifier is as follows:
the basic objective function that defines a semi-supervised linear regression classifier is:
Figure BDA0002980321250000095
wherein the content of the first and second substances,
Figure BDA0002980321250000096
is a diagonal matrix if sample xiIf it is a sample with a label, Uii1, otherwise Uii0, wherein UiiRepresents the element in the ith row and ith column of the matrix U;
combining the objective function with the objective function learned by the subspace in the step 4) and the objective function learned by the similarity matrix in the step 5) to obtain a final objective function:
Figure BDA0002980321250000097
wherein Loss ═ tr ((W)TX+b1T-Y)U(WTX+b1T-Y)T) (ii) a The parameters alpha, theta and beta are weights for adjusting the importance degree of each item;
and respectively solving the partial derivatives of the final objective function to W and b to obtain the update formulas of W and b as follows:
W=[XUcXT+αXLXT+β(I-AAT)+γI]-1XUcYT
Figure BDA0002980321250000101
wherein the intermediate variable
Figure BDA0002980321250000102
And then, updating W and b according to the updating formula, and finishing the learning process of the semi-supervised linear regression classifier.
7) Circularly performing the step 4) to the step 6), and iteratively learning each variable until convergence; when convergence occurs, the joint optimal solution is obtained through the three processes of subspace learning, graph construction and classifier learning.
8) Classifying the test samples, and assuming that the input test sample is x, predicting (x) the prediction label is:
Figure BDA0002980321250000103
wherein (W)Tx+b)iRepresents a vector (W)Tx + b) th element.
9) Calculating the classification accuracy: and inputting a label of the test sample, comparing the label with a prediction result, and calculating the final classification accuracy.
FIG. 2 is a comparison table of the accuracy of the present invention compared with the conventional semi-supervised classification algorithm and the semi-supervised classification algorithm based on the graph, SSCNGC is the abbreviation of the method of the present invention, the number thickening is the best effect, and the data format is "accuracy + -standard deviation". As can be seen from the figure, in the experiment of 16 high-dimensional data sets, the invention achieves the highest accuracy on 15 data sets and achieves the improvement of more than 5% on 9 data sets, which shows that the invention has stronger superiority compared with the traditional semi-supervised algorithm.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A semi-supervised classification method for high-dimensional data is characterized by comprising the following steps:
1) inputting a training data set which is a high-dimensional data set;
2) normalizing the data, eliminating the influence of different characteristic dimensions, and simultaneously improving the speed of subsequent optimization learning;
3) initializing a regression matrix
Figure FDA0002980321240000011
Subspace projection matrix
Figure FDA0002980321240000012
Wherein d is the number of features of the sample, c is the number of sample classes,
Figure FDA0002980321240000013
a real matrix representing d rows and c columns; initializing a low rank decomposition matrix of W
Figure FDA0002980321240000014
Wherein
Figure FDA0002980321240000015
A real number matrix representing c rows and c columns; initializing a similarity matrix
Figure FDA0002980321240000016
Parameter matrix
Figure FDA0002980321240000017
Where n is the number of samples in the sample,
Figure FDA0002980321240000018
a real number matrix representing n rows and n columns; initializing a bias vector
Figure FDA0002980321240000019
Wherein
Figure FDA00029803212400000110
A real matrix representing c rows and 1 columns;
4) and (3) subspace learning: deriving an optimal solution of a low-rank decomposition matrix B, a parameter matrix C and a subspace projection matrix A according to the proposed subspace learning objective function; because the proposed objective function relates to a plurality of optimization variables, an alternate optimization method is used for iteratively updating B, C, A, the subspace quality is gradually optimized, and the optimal subspace for expressing the essential characteristics of the sample is learned;
5) comprehensively learning a sample similarity matrix from two aspects of a sample subspace and a sample label space; defining samples as nodes of the graph, defining the similarity among the samples as edges of the graph, wherein the learning process of the sample similarity matrix is the construction process of the graph;
6) learning a semi-supervised linear regression classifier, namely learning a regression matrix W and a bias vector b, on the basis of the subspace learning in the step 4) and the similarity matrix learning in the step 5);
7) circularly performing the step 4) to the step 6), and iteratively learning each variable until convergence; when convergence occurs, joint optimal solutions are obtained through three processes of subspace learning, graph construction and classifier learning;
8) classifying the test samples, assuming that the input test sample is x and the number of sample classes is c, predicting (x) of the prediction label is as follows:
Figure FDA0002980321240000021
wherein (W)Tx+b)iRepresents a vector (W)Tx + b) th element;
9) calculating the classification accuracy: and inputting a label of the test sample, comparing the label with a prediction result, and calculating the final classification accuracy.
2. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 2), the data normalization step is: obtaining the maximum value X (r) corresponding to the r row datamaxAnd minimum value X (r)minAnd converting the d-th row of data according to the following formula:
Figure FDA0002980321240000022
wherein the content of the first and second substances,
Figure FDA0002980321240000023
for the ith data of the r-th row,
Figure FDA0002980321240000024
for the data after update, n is the number of samples in the dataset, d is the number of features of the samples, i ∈ {1, 2.., n }, r ∈ {1, 2.., d }.
3. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 3), the initialization method is as follows: initializing a regression matrix
Figure FDA0002980321240000025
Is an all-zero matrix; initializing a low rank decomposition matrix of W
Figure FDA0002980321240000026
Is an all-zero matrix; initializing a similarity matrix
Figure FDA0002980321240000027
And parameter matrix
Figure FDA0002980321240000028
Is an all-zero matrix; initializing a bias vector
Figure FDA0002980321240000029
Is an all-zero vector; initializing a subspace projection matrix a ═ qf (r) to be an orthogonal matrix; wherein the content of the first and second substances,
Figure FDA00029803212400000210
is a random matrix with each element in the interval 0,1]And qf (·) denotes QR decomposition.
4. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 4), the subspace learning process is as follows:
the objective function defining the subspace learning is:
Figure FDA00029803212400000211
wherein tr (-) is the trace of the matrix,
Figure FDA0002980321240000031
the F-norm of the matrix is represented,
Figure FDA0002980321240000032
in the form of a matrix of samples,
Figure FDA0002980321240000033
a matrix of real numbers representing d rows and n columns,
Figure FDA0002980321240000034
in the form of a regression matrix,
Figure FDA0002980321240000035
for the projection matrix of the subspace,
Figure FDA0002980321240000036
is a low-rank decomposition matrix of W,
Figure FDA0002980321240000037
is a parameter matrix; alpha, theta, beta are adjustable parameters;
the target functions are subjected to partial derivation of B, C, A respectively, and updating formulas of all variables can be obtained; next, each variable is updated as required:
a. according to the formula B ═ ATW updates B;
b. according to the formula C ═ XTAATX+I)-1XTAATUpdating C by X;
c. circularly updating the subspace projection matrix A according to the following formula: a. thet+1=qf(At+ G) until convergence;
wherein, I is an identity matrix, t represents the t-th iteration, AtRepresents the value of A in the t-th iteration, At+1Denotes the value of A at iteration t +1, G denotes the gradient of the objective function, G ═ 2(X (. alpha.L + theta (I-C))T)XTA-βWBT) And qf (·) denotes QR decomposition.
5. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 5), the construction process of the graph is as follows: the similarity matrix is jointly learned from two aspects of the sample label space and the sample subspace, and an objective function for defining the similarity matrix learning is as follows:
Figure FDA0002980321240000038
wherein tr (-) is the trace of the matrix,
Figure FDA0002980321240000039
the F-norm of the matrix is represented,
Figure FDA00029803212400000310
is a regression matrix, and the regression matrix is,
Figure FDA00029803212400000311
is a projection matrix of the subspace,
Figure FDA00029803212400000312
is a matrix of samples of the sample to be sampled,
Figure FDA00029803212400000313
a matrix of real numbers representing d rows and n columns,
Figure FDA00029803212400000314
is a matrix of the degree of similarity (or similarity matrix),
Figure FDA00029803212400000315
is a laplacian matrix and L ═ D-S, D is a diagonal matrix,
Figure FDA00029803212400000316
Diirepresenting the elements of matrix D at row i and column i, SijRepresenting the elements in the ith row and the jth column of the similarity matrix S; the parameter λ is the weight of the regularization term;
setting the number of neighbors of each sample as k, namely, the similarity between each sample and the k neighbor samples is not 0, and the similarity is 0; let xi,xjRespectively representing the ith and jth samples; definition eijIs xiAnd xjSum of Euclidean distance in subspace and Euclidean distance in tag space, then eijThe calculation formula of (a) is as follows:
Figure FDA0002980321240000041
then, according to the solution of the objective function, an update formula of the similarity matrix S can be obtained:
Figure FDA0002980321240000042
wherein the intermediate variable
Figure FDA0002980321240000043
6. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 6), the learning process of the semi-supervised linear regression classifier is as follows:
the basic objective function that defines a semi-supervised linear regression classifier is:
Figure FDA0002980321240000044
where tr (-) is the trace of the matrix,
Figure FDA0002980321240000045
the F-norm of the matrix is represented,
Figure FDA0002980321240000046
is a regression matrix, and the regression matrix is,
Figure FDA0002980321240000047
is a matrix of samples of the sample to be sampled,
Figure FDA0002980321240000048
a matrix of real numbers representing d rows and n columns,
Figure FDA0002980321240000049
is a vector of the offset to the offset,
Figure FDA00029803212400000410
is the label matrix of the sample, the parameter gamma is the regularizing term weight,
Figure FDA00029803212400000411
is a diagonal matrix if sample xiIf it is a sample with a label, Uii1, otherwise Uii0, wherein UiiRepresents the element in the ith row and ith column of the matrix U;
combining the objective function with the objective function learned by the subspace in the step 4) and the objective function learned by the similarity matrix in the step 5) to obtain a final objective function:
Figure FDA00029803212400000412
wherein Loss ═ tr ((W)TX+b1T-Y)U(WTX+b1T-Y)T),
Figure FDA00029803212400000413
Is a matrix of parameters that is,
Figure FDA00029803212400000414
is a projection matrix of the subspace,
Figure FDA00029803212400000415
is the moment of similarityArraying;
Figure FDA00029803212400000416
is a low-rank decomposition matrix of W,
Figure FDA00029803212400000417
is a laplacian matrix and L ═ D-S, D is a diagonal matrix,
Figure FDA00029803212400000418
Diirepresenting the elements of matrix D at row i and column i, SijRepresenting the elements in the ith row and the jth column of the similarity matrix S; the parameters alpha, theta and beta are weights for adjusting the importance degree of each item;
and respectively solving the partial derivatives of the final objective function to W and b to obtain the update formulas of W and b as follows:
W=[XUcXT+αXLXT+β(I-AAT)+γI]-1XUcYT
Figure FDA0002980321240000051
wherein the intermediate variable
Figure FDA0002980321240000052
And then, updating W and b according to the updating formula, and finishing the learning process of the semi-supervised linear regression classifier.
CN202110285595.XA 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data Active CN113033641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110285595.XA CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110285595.XA CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Publications (2)

Publication Number Publication Date
CN113033641A true CN113033641A (en) 2021-06-25
CN113033641B CN113033641B (en) 2022-12-16

Family

ID=76471055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110285595.XA Active CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Country Status (1)

Country Link
CN (1) CN113033641B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841214A (en) * 2022-05-18 2022-08-02 杭州电子科技大学 Pulse data classification method and device based on semi-supervised discrimination projection

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968639A (en) * 2012-09-28 2013-03-13 武汉科技大学 Semi-supervised image clustering subspace learning algorithm based on local linear regression
WO2015167526A1 (en) * 2014-04-30 2015-11-05 Hewlett-Packard Development Company, L.P Facilitating interpretation of high-dimensional data clusters
CN106778832A (en) * 2016-11-28 2017-05-31 华南理工大学 The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN109784392A (en) * 2019-01-07 2019-05-21 华南理工大学 A kind of high spectrum image semisupervised classification method based on comprehensive confidence
US20200019817A1 (en) * 2018-07-11 2020-01-16 Harbin Institute Of Technology Superpixel classification method based on semi-supervised k-svd and multiscale sparse representation
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN112232438A (en) * 2020-11-05 2021-01-15 华东理工大学 High-dimensional image representation-oriented multi-kernel subspace learning framework

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968639A (en) * 2012-09-28 2013-03-13 武汉科技大学 Semi-supervised image clustering subspace learning algorithm based on local linear regression
WO2015167526A1 (en) * 2014-04-30 2015-11-05 Hewlett-Packard Development Company, L.P Facilitating interpretation of high-dimensional data clusters
CN106778832A (en) * 2016-11-28 2017-05-31 华南理工大学 The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
US20200019817A1 (en) * 2018-07-11 2020-01-16 Harbin Institute Of Technology Superpixel classification method based on semi-supervised k-svd and multiscale sparse representation
CN109784392A (en) * 2019-01-07 2019-05-21 华南理工大学 A kind of high spectrum image semisupervised classification method based on comprehensive confidence
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN112232438A (en) * 2020-11-05 2021-01-15 华东理工大学 High-dimensional image representation-oriented multi-kernel subspace learning framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张乙东: "自适应半监督集成分类算法在高维数据上的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841214A (en) * 2022-05-18 2022-08-02 杭州电子科技大学 Pulse data classification method and device based on semi-supervised discrimination projection

Also Published As

Publication number Publication date
CN113033641B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
Tian et al. Multiple classifier combination for recognition of wheat leaf diseases
Fan et al. Multi-view subspace learning via bidirectional sparsity
Liu et al. Group collaborative representation for image set classification
CN111931814B (en) Unsupervised countering domain adaptation method based on intra-class structure tightening constraint
CN110263855B (en) Method for classifying images by utilizing common-basis capsule projection
CN113869404B (en) Self-adaptive graph roll accumulation method for paper network data
CN111259938B (en) Manifold learning and gradient lifting model-based image multi-label classification method
Tao et al. RDEC: integrating regularization into deep embedded clustering for imbalanced datasets
Wang et al. Hyperspectral image classification based on domain adversarial broad adaptation network
Dwivedi et al. A leaf disease detection mechanism based on L1-norm minimization extreme learning machine
CN116071560A (en) Fruit identification method based on convolutional neural network
CN116883723A (en) Combined zero sample image classification method based on parallel semantic embedding
CN116645579A (en) Feature fusion method based on heterogeneous graph attention mechanism
CN113033641B (en) Semi-supervised classification method for high-dimensional data
Chen et al. Deep convolutional network for citrus leaf diseases recognition
Zhou et al. Semantic adaptation network for unsupervised domain adaptation
CN110175631A (en) A kind of multiple view clustering method based on common Learning Subspaces structure and cluster oriental matrix
You et al. Robust structure low-rank representation in latent space
CN111488923A (en) Enhanced anchor point image semi-supervised classification method
Chen et al. Semi-supervised convolutional neural networks with label propagation for image classification
CN114168822A (en) Method for establishing time series data clustering model and time series data clustering
Nikolaou et al. Margin maximization as lossless maximal compression
Du et al. Robust spectral clustering via matrix aggregation
CN112766354A (en) Knowledge graph-based small sample picture identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant