CN109829472B

CN109829472B - Semi-supervised classification method based on probability nearest neighbor

Info

Publication number: CN109829472B
Application number: CN201811598286.2A
Authority: CN
Inventors: 马君亮; 汪西莉; 刘琦; 肖冰; 唐铭英
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2024-05-14
Anticipated expiration: 2038-12-24
Also published as: CN109829472A

Abstract

The invention relates to a semi-supervised classification method based on probability nearest neighbor, which comprises the steps of preparing a data set, preprocessing the data set, constructing a probability nearest neighbor matrix S, initializing a probability transition matrix P into the probability nearest neighbor matrix S, performing label propagation on an original data set, updating, checking convergence, obtaining classification results and the like. The method solves the problems that in the existing semi-supervised classification based on the graph, the construction of the similarity graph cannot well reflect actual conditions and the classification is inaccurate, and the classification method disclosed by the invention can be closer to the actual conditions, so that the classification result is more accurate.

Description

Semi-supervised classification method based on probability nearest neighbor

Technical Field

The present disclosure relates to data classification methods, and in particular, to a semi-supervised classification method based on probability neighbors.

Background

The existing data classification methods comprise methods of supervised classification, semi-supervised classification, unsupervised classification and the like. The supervision and classification method requires a large amount of marked data to train the model, so that the application scene is limited; the unsupervised classification does not need the class information of the data, and has wide application, but the classification effect is poor due to the lack of the class information. The semi-supervision only needs a small amount of marked data, so that the acquisition cost is low, and a better classification effect can be obtained by learning the data distribution of a large amount of unmarked data, so that the semi-supervision has a wide application scene.

The semi-supervised classification based on the graph is an important branch in the semi-supervised classification, and because the relation between data is fully utilized, a good effect is often obtained, and the great attention is paid. In the methods, the relation between the data is represented in the similarity graph, and the graph structure and the matrix have a one-to-one correspondence relation, so that the similarity graph can be represented by the similarity matrix, and meanwhile, the matrix can be further converted into a probability transition matrix for tag propagation between the data, and further, the tag propagation is carried out by using the category information of the marked data, so that a classification result is obtained.

However, in the current semi-supervised classification method based on the graph, the similarity graph is often constructed by a K-nearest neighbor (KNN) or epsilon-nearest neighbor method, in the process of constructing the graph, only the attribute features of the data are used, the class information of the marked data is not used, the obtained similarity graph cannot well reflect the actual situation, and the classification result is also inaccurate.

Disclosure of Invention

Aiming at the problems, the present disclosure provides a semi-supervised classification method based on probability nearest neighbor, which solves the problems that in the existing semi-supervised classification based on graph, the construction of a similarity graph cannot well reflect actual conditions and classification is inaccurate, and the classification method of the present disclosure can be closer to the actual conditions, so that classification results are more accurate.

Specifically, the disclosure provides a semi-supervised classification method based on probability nearest neighbor, which comprises the following steps:

s100, preparing an original data set, wherein the original data set comprises original data and matched labels thereof, the characteristics of the original data are described through data attributes, and the original data comprises marked data and unmarked data;

S200, preprocessing the data set prepared in the step S100 to obtain a label indication vector matrix V _L of marked data;

s300, constructing a probability neighbor matrix S of an original data set;

s400, constructing a category information matrix F and a probability transition matrix P of the original data set, and initializing the probability transition matrix P into a probability neighbor matrix S obtained in the step S300;

s500, based on the initialized probability transition matrix P in the step S400, performing label propagation on the original data set to obtain a new category information matrix F';

S600, updating a marked data matrix F _L of a new category information matrix F' after label propagation by using a label indication vector matrix V _L of marked data, so as to prevent pollution of label information;

s700, checking whether the new category information matrix F 'obtained in the step S500 is converged, if the new category information matrix F' is converged and is not changed, entering the step S800, otherwise, returning to the step S500;

s800, obtaining a classification result of the original data set through the new category information matrix F' after convergence.

Compared with the prior art, the method has the following beneficial technical effects:

(1) The classification method disclosed by the invention has the advantages that the similarity between two data points is calculated by adding the class information of the data when the data is formed, the actual situation of the data can be well reflected by the constructed similarity graph, and the accuracy of data classification is improved compared with the existing semi-supervised classification method;

(2) The classification method defines a probability neighbor matrix, and the composition problem is regarded as a probability problem, so that the classification result is more accurate.

Drawings

FIG. 1 illustrates a flow chart of a probability neighbor based semi-supervised classification method of the present disclosure;

FIG. 2 is a schematic representation of the values of a probability transition matrix constructed based on the prior KNN method;

FIG. 3 illustrates a schematic representation of the values of a probability transition matrix constructed by the method of the present disclosure;

FIG. 4 is a graph showing accuracy contrast of classifying a plurality of different data sets using the prior KNN method and the classification method of the present disclosure.

Detailed Description

The following describes a specific flow of the probability neighbor based semi-supervised classification method of the present disclosure with reference to fig. 1.

In one embodiment, a semi-supervised classification method based on probabilistic neighbor is provided, comprising:

s300, constructing a probability neighbor matrix S of an original data set;

In this embodiment, the tag information pollution in step S600 means that the tag of the marked data may be changed after the processing in step S500, and the tag of the marked data may be wrong, which is called tag information pollution.

In this embodiment, the implementation steps of the classification method proposed in the present disclosure are described in detail, including the steps of preparing a dataset, preprocessing the dataset, constructing a probability neighbor matrix S, initializing a probability transition matrix P to the probability neighbor matrix S, performing label propagation on the original dataset, updating, checking convergence, obtaining classification results, and the like. The method disclosed by the invention can be closer to the actual condition of data classification, so that the classification result is more accurate.

In a preferred embodiment, the step S200 specifically includes:

S201, defining an original dataset as X epsilon R ^n*d, wherein R ^n*d represents a matrix of n rows and d columns, n is the number of data points, and d is the attribute number of the data points; defining the marked data matrix as Unlabeled data matrix asWhere n ₁ is the number of labeled data points, n ₂ is the number of unlabeled data points, n=n ₁+n₂;

s202, constructing a label indication vector matrix of the marked data according to the label corresponding to the marked data:

Constructing a tag indication vector matrix V epsilon R ^n*c according to the tag corresponding to the marked data of the original data set, wherein n is the number of data, c is the number of data types, the ith row of the tag indication vector matrix V is the tag indication vector of the ith data, if the type of the ith data point is j, the jth element of the row is 1, and the rest data of the row is 0; partial data points are decimated from the original data set according to a certain proportion, and corresponding rows in the label indication vector matrix V are formed into a matrix according to the label sequence of the decimated data points The labels as marked data indicate a vector matrix.

In this embodiment, a specific description is given of how the data set is preprocessed in step S200 to obtain the tag indication vector matrix V _L of the marked data. The selection ratio in step S202 may be selected from the original data and 5% -10% of the data as marked data to construct V _L according to the actual situation, and the rest is the unmarked data. In experiments following the present disclosure, experiments were performed with a selection of 10%.

In a preferred embodiment, the step S300 specifically includes:

s301, defining an augmentation matrix A epsilon R ^n*(d+c) about data, wherein d is the dimension of the data, the number of data attributes is represented, c is the number of data categories, n is the number of data points, and A= [ X, V ];

s302, defining D epsilon R ^n*n as a similarity matrix between data points, and calculating the similarity between the data points i and j by taking Euclidean distance as a measure Let/>As a specific numerical value of the ith row and the jth column in the matrix D, arranging the matrix D in ascending order, X _i represents the ith row of the original data set X, and X _j represents the jth row of the original data set X;

s303, defining a probability neighbor matrix of an original data set as S epsilon R ^n*n, wherein S is a matrix of n x n, a value S _i,j of an ith row and a jth column in the S matrix represents possibility that a data point i and a data point j become probability neighbors, defining K as a neighbor number, selecting the first K of a matrix D according to rows according to a K-neighbor (KNN) method to obtain K-neighbors, and constructing the probability neighbors on the basis of the K-neighbors, wherein the probability neighbors are as follows: defining the probability that p (i, k) becomes a probability neighbor between data point i and its kth neighbor, by the formula Obtaining;

Then through the formula Performing assignment, wherein S (i, k) represents the value of a kth neighbor of an ith data point in an S matrix, and an unassigned element in the S matrix is assigned to 0 to obtain an assigned probability neighbor matrix S; wherein the method comprises the steps ofRepresenting the specific value of column (k+1) of row j in matrix D,/>Is a specific value of the j-th column of the i-th row in matrix D,Is the specific numerical value of the ith row and 1 st column in the matrix D, n is the number of data points, and X represents the original data set.

In this embodiment, a specific method of how to construct the probability neighbor matrix S of the original data set in step S300 is described, where the similarity between two data points is calculated by adding the class information of the data during composition, and a probability neighbor matrix is defined, and the composition problem is regarded as a probability problem to be solved, so that the classification method is closer to the actual situation, and the classification result is more accurate.

In this embodiment, a good effect can be obtained by the local characteristics of the data, and the value of the k neighbor number in step S303 can be selected to be 5-20.

In a preferred embodiment, the step S400 specifically includes:

constructing a category information matrix F of the original data set: f ⁽⁰⁾＝[F_L,0],F⁽⁰⁾ represents the initialized F matrix:

constructing a probability transition matrix P of the original data set: initializing the probability transition matrix P as the assigned probability neighbor matrix S obtained in the step S300, which includes: p ⁽⁰⁾ = S; where P ε R ^n*n is the probability transition matrix, P ⁽⁰⁾ represents the initialized P matrix, and P _i,j represents the value of the ith row and jth column in the P matrix, indicating that the ith data point will propagate with probability P _i,j to the jth data point.

In a preferred embodiment, the step S500 specifically includes:

Tag propagation is performed by the formula F ' ^(t+1)＝P*F'^(t), where the superscript t denotes the number of iterations, F ^(t) denotes the F ' matrix obtained by the t-th iteration, F ' ^(t+1) denotes the F ' matrix obtained by the t+1st iteration, and the formula denotes the tag propagation from the t-th to the t+1st, where when t=0, F ' ^(t) denotes the original category information matrix F.

In this embodiment, a specific process of tag propagation is described, and a plurality of iterations are performed in combination with a probability transition matrix on the basis of an original category information matrix F, so as to perform tag propagation on unlabeled data according to tags of the labeled data, and classify the unlabeled data.

In a preferred embodiment, the step S600 specifically includes:

F _L＝V_L, reassigning the labeled data F _L to a labeled data matrix V _L,F_L representing the category information matrix F after each iteration of label propagation, and V _L representing the label indication vector matrix of the labeled data.

In this embodiment, in order to prevent the tag of the marked data from being contaminated after the process of step S500, reassignment is required.

In a preferred embodiment, in the step S800:

And each row of data of the new category information matrix F 'after convergence represents the category of the data, if the value of F _ij is 1, the category of the ith data point is j, so that the category corresponding to each data can be obtained from the F' matrix, and the classification result of the original dataset is obtained.

The method is used for classifying the data according to the method, and can be closer to the actual situation, so that the classification result is more accurate.

Experiment:

In order to verify the advantages of the semi-supervised classification method based on probability neighbor provided by the disclosure compared with the existing classification method based on KNN, experimental comparison and verification are carried out.

In the first experiment, a double-month data set is adopted, the double-month data set is of a known data set type, the number of data points can be selected according to actual needs, and the number of the double-month data points selected in the experiment is 400. 10% of the dataset was selected in this experiment as marked data construct V _L, the remainder as unmarked data.

Fig. 2 shows the values of the probability transition matrix constructed based on the existing KNN method, and fig. 3 shows the values of the probability transition matrix constructed by the method of the present invention. Fig. 2 and fig. 3 are values of probability transition matrixes generated on the same data set by two methods, gray scales at points in the graph represent probability values of neighbors among different nodes, and as can be seen from results, the probability transition matrixes generated by the two methods have small proportion of non-zero elements, so that the corresponding graph structures are relatively sparse. The difference is that: in fig. 2, the gray levels corresponding to the non-zero elements are the same, so that in the KNN method, the neighboring points have the same tag propagation probability, which is not consistent with the actual situation; in fig. 3, the gray scales corresponding to the non-zero elements are different, and the label propagation probabilities of the adjacent points are different, so that the method is more in line with the actual situation, and the interrelationship between the data is expressed.

In the second experiment, for the existing multiple different data sets (as shown in the abscissa of fig. 4), the data classification is performed by adopting a method of knn+ tag propagation (knn+lp), and the data classification is performed by adopting a method of pnn+ tag propagation (pnn+lp) in the method of the present invention, and the accuracy of classification results of the two methods on the different data sets is shown by comparing with fig. 4, it can be found that the classification result obtained by adopting the classification method (pnn+lp) of the present disclosure is superior to the prior art, and the classification accuracy is higher.

In summary, the method solves the problems that in the existing semi-supervised classification based on the graph, the construction of the similarity graph cannot well reflect actual conditions and classification is inaccurate, and compared with the prior art, the classification method disclosed by the invention can be closer to the actual conditions, so that the classification result is more accurate.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims

1. A semi-supervised classification method based on probability neighbor performed on a computer, the method aiming at the problem that the classification is inaccurate due to the construction of a similarity graph in the existing semi-supervised classification based on the graph, the method is characterized by comprising the following steps:

s300, constructing a probability neighbor matrix S of an original data set;

s800, obtaining a classification result of the original data set through the new category information matrix F' after convergence;

the data set in the step S100 is one of USPS, COIL1 and COIL2 data sets;

The method adds category information of the used data to calculate the similarity between two data points when the composition is carried out;

the method defines a probability neighbor matrix, and the composition problem is regarded as a probability problem, so that the classification result of machine learning is more accurate.

2. The method according to claim 1, wherein the step S200 specifically includes:

s201, defining the original data set as Wherein/>Representing a matrix of n rows and d columns, n being the number of data points and d being the number of attributes of the data points; defining the marked data matrix as/>Unlabeled data matrix is/>Where n ₁ is the number of labeled data points, n ₂ is the number of unlabeled data points, n=n ₁+n₂;

constructing a label indication vector matrix according to labels corresponding to marked data of an original data set Where n is the number of data, c is the number of data categories, the label indicates that the ith row of vector matrix V is a label indicating vector for the ith data point, if the category of the ith data point is j, then the jth element of the row is 1, and the rest of the data of the row is 0; according to the sequence of marked data in the original data, the corresponding rows in the decimated label indication vector V form a matrix/>The labels as marked data indicate a vector matrix.

3. The method according to claim 2, wherein the step S300 specifically includes:

s301, defining an augmentation matrix for data D is the data dimension, represents the number of data attributes, c represents the number of data categories, n is the number of data points,/>；

S302, definitionAs a similarity matrix between data points, the Euclidean distance is used as a measure to calculate the similarity/>, between the data points i and jLet/>As a specific numerical value of the ith row and the jth column in the matrix D, the matrix D is arranged in ascending order, X _i represents the ith row of the original data set X, and X _j represents the jth row of the original data set X;

S303, defining probability neighbor matrixes of the original data set as S is a matrix of n x n, the value of the j-th column of the i-th row in the S matrix/>Representing the possibility that data point i and data point j become probability neighbors, defining K as a neighbor number, selecting the first K of the matrix D according to a row by a K-neighbor (KNN) method to obtain K-neighbors, and constructing the probability neighbors on the basis of the K-neighbors, wherein the probability neighbors are specifically as follows: definition/>Probability of being a probability neighbor between data point i and its kth neighbor is calculated by the formula/>Obtaining;

Then through the formula Performing assignment, wherein S (i, k) represents the value of a kth neighbor of an ith data point in an S matrix, and an unassigned element in the S matrix is assigned to 0 to obtain an assigned probability neighbor matrix S; wherein the method comprises the steps ofRepresenting the specific value of the ith row and the (k+1) th column in matrix D,/>Is a specific value of the ith row and jth column in matrix D,/>Is the specific numerical value of the ith row and 1 st column in the matrix D, n is the number of data points, and X represents the original data set.

4. A method according to claim 3, wherein said step S400 specifically comprises:

Constructing a category information matrix F of the original data set: f ⁽⁰⁾ represents the initialized F matrix:

Constructing a probability transition matrix P of the original data set: initializing the probability transition matrix P as the assigned probability neighbor matrix S obtained in the step S300, which includes: p ⁽⁰⁾ = S; wherein, Is a probability transition matrix,/>Representing the initialized P matrix, P _i,j represents the value of the ith row and jth column in the P matrix, indicating that the ith data point will propagate with probability P _i,j to the jth data point.

5. The method according to claim 4, wherein the step S500 specifically includes:

By the formula Label propagation is performed, wherein the superscript t denotes the iteration number, F ' ^(t) denotes the F ' matrix obtained by the t-th iteration process, F ' ^(t+1) denotes the F ' matrix obtained by the t+1st iteration process, and the above formula denotes the label propagation process from the t-th to the t+1st time, wherein when t=0, F ' ^(t) denotes the original category information matrix F.

6. The method according to claim 5, wherein the step S600 specifically includes:

f _L=V_L, reassigning the labeled data F _L to a labeled data matrix V _L,F_L representing the category information matrix F after each iteration of label propagation, and V _L representing the label indication vector matrix of the labeled data.

7. The method according to claim 6, wherein in the step S800:

The new category information matrix F' after convergence each row of data represents the category of the data if And if the value is 1, the class of the ith data point is j, so that the class corresponding to each data can be obtained from the F' matrix, and the classification result of the original data set is obtained.

8. In the method according to claim 3, in the step S303, the value of the k neighbor number is selected to be 5-20.