CN112418142A

CN112418142A - Feature selection method based on graph signal processing

Info

Publication number: CN112418142A
Application number: CN202011405315.6A
Authority: CN
Inventors: 蒋俊正; 王薇
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-02-26

Abstract

The invention discloses a characteristic selection method based on graph signal processing, which is characterized by comprising the following steps of: 1) constructing a graph model; 2) calculating sample smoothness; 3) and obtaining a dimension reduction result. The method can keep the correlation among the samples and improve the dimension reduction effect.

Description

Feature selection method based on graph signal processing

Technical Field

The invention relates to the technical field of graph signal processing, in particular to a feature selection method based on graph signal processing.

Background

In recent years, with the advent of the big data era, a large amount of high-dimensional data such as hyperspectral image data, face recognition data, gene expression data and the like can be obtained from various fields, and when the data are processed and analyzed, the data are often too high in dimensionality, so that the analysis of the data cannot be performed efficiently. To solve the "dimension disaster" problem, it is a necessary step to perform dimension reduction on the data before using the high-dimensional data.

The existing dimension reduction technology mainly comprises two types: one is feature extraction and the other is feature selection. The feature extraction is to map data in a high-dimensional space to a low-dimensional space through spatial transformation, such as local linear embedding LLE, Laplace feature mapping LE and the like, the method can keep the correlation among the data, but redundant data can have certain influence on a dimension reduction result, so that the classification precision is reduced, and due to the fact that a large amount of redundancy exists in the data set, an overfitting phenomenon can occur in the feature extraction; the feature selection method is characterized in that under the condition that an original data space is not changed, data with obvious features are selected based on certain criteria such as correlation coefficients, distance measurement, information gain and consistency, the two typical algorithms are Relieff and MRMR, most features of the original data can be reserved by the method, but the method usually ignores the correlation among the data, so that how to design a model, the model can keep the relation among the data, original features of the data are not changed, and the method is a problem worthy of research.

In recent years, graph signal processing theory is continuously developing in processing and analyzing data of irregular areas, and expands fourier transform, frequency analysis, sampling, filtering and the like in the traditional signal processing theory to the graph signal processing field. With the development of image signal processing, image signal processing has been applied to many fields, such as Weiyu Huang et al, which proposes a new framework for analyzing brain imaging data using image signal processing; leah Goldsberry and the like analyze brain activity signals of different areas by using a map theory; diego Valnesia et al constructed a volume neural network for image denoising; arman Hasanzadeh et al propose a new traffic prediction method using graph stationarity. In these applications, one key point is to model the association between data elements using the topology of the graph, preserving the correlation between data.

Disclosure of Invention

The invention aims to provide a feature selection method based on graph signal processing, aiming at the defects of the prior art. The method can keep the correlation among the samples and improve the dimension reduction effect.

The technical scheme for realizing the purpose of the invention is as follows:

a feature selection method based on graph signal processing comprises the following steps:

1) constructing a graph model: making the high-dimensional data set X belong to R^M×NIs taken as a node on the graph, and each row in the data set represents a node, namely X_i(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., X_j(j ═ 1,2, …, N), using the obtained sample data as a graph signal, i.e. each column in the high-dimensional data set X is a graph signal, and constructing a topological structure on the graph according to the similarity of data between different samples, thereby constructing a graph model between three different sample data, wherein the three graph models are respectively:

model 1: for the ith and the jth nodes, according to the sampling data X_iAnd X_jThe correlation coefficient between the two is measured, and according to the magnitude of the correlation, knn graph G is constructed by using a nearest neighbor algorithm, that is, each node on knn graph G is connected with only k nodes with the maximum correlation value, wherein the calculation formula of the correlation is shown as formula (1):

wherein, EX_iAnd D (X)_i) Are each X_iV is the set of sample points;

model 2: for high-dimensional data set X epsilon R^M×NEach column of (1) generates a graph G_l(l 1, 2.. times.n), i.e., one sample data of one graph corresponding to a node, N graphs are formed, and G is measured using the distance between the ith sample and the jth data of the jth sample_lAnd (3) constructing knn graph G by using a nearest neighbor algorithm according to the correlation between the node pairs in the graph, namely knn graph G is connected with each node which is only connected with k nodes with the maximum correlation values, wherein the calculation formula of the distance between the sampling data of the samples is shown as formula (2):

model 3: model 3 is a variant of model 2, unlike model 2, model 3 is in the context of a high-dimensional dataset X ∈ R^M×NGenerates 1 graph G per column of the data set, each sample data_l(l ═ 1, 2...., N), after N graphs are formed, the laplacian matrices of the N graphs obtained are added to averageSetting other weights of the node i except the k neighbor nodes with the maximum weight in the Laplace matrix as 0, and taking the weights as the Laplace matrix of the model 3;

2) calculating sample smoothness: calculating the smoothness degree of the sample data according to the graph model obtained in the step 1) by using a graph laplacian matrix and a concept of signal smoothness in a graph signal theory, wherein the graph laplacian matrix is L-D-W, D is a degree matrix, W is an adjacent matrix, and a formula of the signal smoothness in the graph signal theory is shown as a formula (3):

wherein f is (f)₁,f₂,...f_N)^TIs a graph signal, L is a Laplace matrix, W_ijIs a graph signal value f_iAnd f_jV is the sample node set.

3) Obtaining a dimensionality reduction result: sorting the calculation results obtained in the step 2) in a descending order, and selecting a sampling data composition set corresponding to the maximum first d calculation result values in the sorting order as a dimension reduction result, wherein d is a dimension obtained after dimension reduction of the high-dimensional data set, namely the data set after dimension reduction is X-E R^M×d。

Compared with the existing feature extraction and feature selection method, the technical scheme utilizes the concept of signal smoothness in the graph signal processing theory to perform dimensionality reduction on data, can retain the features of original data, reduces the influence of irrelevant data or noise on dimensionality reduction results, establishes a network topology between data samples, and retains the correlation between the samples.

The method can keep the correlation among the samples and improve the dimension reduction effect.

Drawings

FIG. 1 is a schematic flow chart of an exemplary method.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Implementation example:

referring to fig. 1, a feature selection method based on graph signal processing includes the following steps:

1) constructing a graph model: making the high-dimensional data set X belong to R^M×NIs taken as a node on the graph, and each row in the data set represents a node, namely X_i(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., X_j(j ═ 1, 2.. times, N), using the obtained sampling data as a graph signal, i.e. each column in the high-dimensional data set X is a graph signal, and constructing a topological structure on the graph according to the similarity of data between different samples, thereby constructing a graph model between three different sample data, wherein the three graph models are respectively:

wherein, EX_iAnd D (X)_i) Are each X_iV is the set of sample points;

model 3: model 3 is a variant of model 2, unlike model 2, model 3 is in the context of a high-dimensional dataset X ∈ R^M×NGenerates 1 graph G per column of the data set, each sample data_l(l ═ 1,2, …, N), after N graphs are formed, adding the obtained laplacian matrices of the N graphs to obtain an average, and setting the weights of the nodes i except the k neighbor nodes with the largest weight in the laplacian matrices as 0, so as to serve as a model 3 laplacian matrix;

wherein f is (f)₁,f₂,...f_N)^TIs a graph signal, L is a Laplace matrix, W_ijIs a graph signal value f_iAnd f_jAnd V is a sample node set, and for each sample, N sampling data are obtained, that is, each column of the data set corresponds to a graph signal in the graph model, so that N signal smoothness f can be obtained^TThe result of Lf calculation, in general, f^TThe smaller the value of Lf, the smoother the signal on the graph, and conversely, the larger the signal fluctuation;

3) obtaining a dimensionality reduction result: sorting the calculation results in the step 2) in a descending order, selecting a sampling data composition set corresponding to the maximum first d calculation result values in the descending order as a dimension reduction result, taking a cancer gene data set as an example, distinguishing a patient from a normal person depending on a few mutant genes, wherein the gene data of the mutant genes may be possibly compared with the gene data of the normal personIn other words, the gene data of the normal person is easy to be smooth, and the gene data of the patient may fluctuate greatly, which can be captured by the smoothness measurement in the formula (3), so that the calculation results of step 2 are sorted in a descending order, and the sampling data corresponding to the maximum previous d values of the result is selected as the dimension reduction result, wherein d is the dimension obtained after the dimension reduction of the high-dimensional data set, that is, the data set after the dimension reduction is X ∈ R^M×d。

The present example takes four cancer gene expression data sets as examples, specifically:

establishing a graph model G (V, E, W) among patient samples of the high-dimensional data set, and establishing X E R for the high-dimensional data set^M×NTaking a data sample as a graph node, taking sample data of the sample as a graph signal, constructing a topological structure on the graph according to the similarity of the sample data among different samples, and establishing three different graph models, wherein in the constructed graph models, G (V, E, W), V (1, 2,.. N) is a graph node set which represents a sample in a high-dimensional data set, and E (E)_ijIs the set of the upper side of the graph, e_ijIndicating that there is an edge connection between node i and node j, W is the weighted adjacency matrix if W_ijNot equal to 0, the node i and the node j are connected by edges, and the weight is W_ijOtherwise W_ijThe degree matrix D is a diagonal matrix with elements D on the diagonal of the diagonal being 0_iiEqual to the sum of the elements of row i of the weighted adjacency matrix W, i.e.

The laplacian matrix is L ═ D-W,

model 1 employs a Nearest Neighbor Graph (NNG) model, therefore, the method of model 1 is named NNG graph-based feature selection algorithm (NNG-FS);

model 2: for high-dimensional data set X epsilon R^M×NEach column of (1) generates a graph G_l(1, 2.. times.n), i.e. one graph corresponds to one sample data, N graphs are generated, and G is measured by using the distance between the ith sample and the jth sample's ith sample data_lCorrelation between pairs of nodes in the graph, D_lThe smaller the (i, j), the higher the correlation between nodes, and the larger the correlationSmall, knn graph G is constructed by adopting a nearest neighbor algorithm, namely, each node on knn graph G is only connected with k nodes with the maximum correlation values, and model 2 utilizes Multiple Distance Graphs (MDGs), namely graphs with weights characterized by the distance between sampling data, so that the method called model 2 is a feature selection method (MDG-FS) based on MDG;

model 3: model 3 is in the high-dimensional data set X ∈ R^M×NGenerates 1 graph G per column of the data set, each sample data_l(l ═ 1, 2.,. N), after N graphs are formed, the obtained laplacian matrices of the N graphs are added to calculate an average, and in the laplacian matrices, the weights of nodes i except for k neighbor nodes with the largest weight are set to 0, which is used as a model 3 laplacian matrix, and similarly, the model 3 utilizes a Combined Distance Graph (CDG), which is called a CDG-based feature selection method (CDG-FS);

simulation example:

the method of this example was applied to four cancer gene datasets downloaded from a common gene database and compared to LLE and PCA, the information for the datasets being listed in table 1:

TABLE 1 Gene data set information

In the simulation experiment, the parameters of the method and LLE in the example are taken as k being 3, d being 20, PCA parameters are set as d being 20, then given characteristic vectors, self-contained Naive Bayes, Support Vector Machine (SVM), Random Forest, KNN and Discriminiant Analysis classifiers in MATLAB are used for detecting the dimensionality reduction effect, wherein the ratio of the training set to the testing set is 1: 2, the classification accuracy of the method, LLE and PCA in the example to four cancer data sets is shown in tables 2, 3, 4 and 5,

TABLE 2 Classification accuracy of Prostate Tumor datasets

TABLE 3 Classification accuracy of Brain Tumor datasets

TABLE 4 Classification accuracy of Gastric data sets

TABLE 5 sorting accuracy of Lung Cancer data set

From the observations in tables 2 to 5, it can be seen that, on the data sets of the prestate Tumor, Brain Tumor and scientific Cancer, the classification performance of the method of the present embodiment is significantly better than that of LLE and PCA, but for the prestate Tumor, the classification performance of PCA using KNN classifier is the best, LLE and PCA are prominent in lung Cancer, and overall, the performance of the method of the present embodiment is better than that of the comparative algorithm, and in addition, the three algorithms show different performances on the data sets of different classes, the treatment effect of CDGGS algorithm on Brain tumors is better than that of the other two methods, and the treatment effect of NNG-GS algorithm on stomach Cancer is the best, and in addition, the classification accuracy of different algorithms is large, for example, the classification accuracy of MDG-GS is at least 7.25% higher than that of the other two algorithms, and the experimental results show that the graph model is the key to achieving good classification performance due to the method using different graph models.

Claims

1. A feature selection method based on graph signal processing is characterized by comprising the following steps:

1) constructing a graph model: making the high-dimensional data set X belong to R^M×NIs taken as a node on the graph, and each row in the data set represents a node, namely X_i(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., X_j(j ═ 1, 2.., N), the product obtainedSampling data serving as a graph signal, namely each column in a high-dimensional data set X is a graph signal, and a topological structure on a graph is constructed according to the similarity of data among different samples, so that a graph model among three different sample data is constructed, wherein the three graph models are respectively as follows:

wherein, EX_iAnd D (X)_i) Are each X_iV is the set of sample points;

model 3: model 3 is in the high-dimensional data set X ∈ R^M×NGenerates 1 graph G per column of the data set, each sample data_l(l ═ 1, 2.... An, N), after N graphs are formed, the obtained Laplace matrixes of the N graphs are added to calculate the average, and the division weight of the node i in the Laplace matrix is the maximumSetting other weights except the k neighbor nodes as 0, and taking the weights as a model 3 Laplacian matrix;

wherein f is (f)₁,f₂,...f_N)^TIs a graph signal, L is a Laplace matrix, W_ijIs a graph signal value f_iAnd f_jV is a sample node set;