CN112418142A - Feature selection method based on graph signal processing - Google Patents

Feature selection method based on graph signal processing Download PDF

Info

Publication number
CN112418142A
CN112418142A CN202011405315.6A CN202011405315A CN112418142A CN 112418142 A CN112418142 A CN 112418142A CN 202011405315 A CN202011405315 A CN 202011405315A CN 112418142 A CN112418142 A CN 112418142A
Authority
CN
China
Prior art keywords
graph
data
sample
model
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011405315.6A
Other languages
Chinese (zh)
Inventor
蒋俊正
王薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202011405315.6A priority Critical patent/CN112418142A/en
Publication of CN112418142A publication Critical patent/CN112418142A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a characteristic selection method based on graph signal processing, which is characterized by comprising the following steps of: 1) constructing a graph model; 2) calculating sample smoothness; 3) and obtaining a dimension reduction result. The method can keep the correlation among the samples and improve the dimension reduction effect.

Description

Feature selection method based on graph signal processing
Technical Field
The invention relates to the technical field of graph signal processing, in particular to a feature selection method based on graph signal processing.
Background
In recent years, with the advent of the big data era, a large amount of high-dimensional data such as hyperspectral image data, face recognition data, gene expression data and the like can be obtained from various fields, and when the data are processed and analyzed, the data are often too high in dimensionality, so that the analysis of the data cannot be performed efficiently. To solve the "dimension disaster" problem, it is a necessary step to perform dimension reduction on the data before using the high-dimensional data.
The existing dimension reduction technology mainly comprises two types: one is feature extraction and the other is feature selection. The feature extraction is to map data in a high-dimensional space to a low-dimensional space through spatial transformation, such as local linear embedding LLE, Laplace feature mapping LE and the like, the method can keep the correlation among the data, but redundant data can have certain influence on a dimension reduction result, so that the classification precision is reduced, and due to the fact that a large amount of redundancy exists in the data set, an overfitting phenomenon can occur in the feature extraction; the feature selection method is characterized in that under the condition that an original data space is not changed, data with obvious features are selected based on certain criteria such as correlation coefficients, distance measurement, information gain and consistency, the two typical algorithms are Relieff and MRMR, most features of the original data can be reserved by the method, but the method usually ignores the correlation among the data, so that how to design a model, the model can keep the relation among the data, original features of the data are not changed, and the method is a problem worthy of research.
In recent years, graph signal processing theory is continuously developing in processing and analyzing data of irregular areas, and expands fourier transform, frequency analysis, sampling, filtering and the like in the traditional signal processing theory to the graph signal processing field. With the development of image signal processing, image signal processing has been applied to many fields, such as Weiyu Huang et al, which proposes a new framework for analyzing brain imaging data using image signal processing; leah Goldsberry and the like analyze brain activity signals of different areas by using a map theory; diego Valnesia et al constructed a volume neural network for image denoising; arman Hasanzadeh et al propose a new traffic prediction method using graph stationarity. In these applications, one key point is to model the association between data elements using the topology of the graph, preserving the correlation between data.
Disclosure of Invention
The invention aims to provide a feature selection method based on graph signal processing, aiming at the defects of the prior art. The method can keep the correlation among the samples and improve the dimension reduction effect.
The technical scheme for realizing the purpose of the invention is as follows:
a feature selection method based on graph signal processing comprises the following steps:
1) constructing a graph model: making the high-dimensional data set X belong to RM×NIs taken as a node on the graph, and each row in the data set represents a node, namely Xi(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., Xj(j ═ 1,2, …, N), using the obtained sample data as a graph signal, i.e. each column in the high-dimensional data set X is a graph signal, and constructing a topological structure on the graph according to the similarity of data between different samples, thereby constructing a graph model between three different sample data, wherein the three graph models are respectively:
model 1: for the ith and the jth nodes, according to the sampling data XiAnd XjThe correlation coefficient between the two is measured, and according to the magnitude of the correlation, knn graph G is constructed by using a nearest neighbor algorithm, that is, each node on knn graph G is connected with only k nodes with the maximum correlation value, wherein the calculation formula of the correlation is shown as formula (1):
Figure BDA0002818383100000021
wherein, EXiAnd D (X)i) Are each XiV is the set of sample points;
model 2: for high-dimensional data set X epsilon RM×NEach column of (1) generates a graph Gl(l 1, 2.. times.n), i.e., one sample data of one graph corresponding to a node, N graphs are formed, and G is measured using the distance between the ith sample and the jth data of the jth samplelAnd (3) constructing knn graph G by using a nearest neighbor algorithm according to the correlation between the node pairs in the graph, namely knn graph G is connected with each node which is only connected with k nodes with the maximum correlation values, wherein the calculation formula of the distance between the sampling data of the samples is shown as formula (2):
Figure BDA0002818383100000022
model 3: model 3 is a variant of model 2, unlike model 2, model 3 is in the context of a high-dimensional dataset X ∈ RM×NGenerates 1 graph G per column of the data set, each sample datal(l ═ 1, 2...., N), after N graphs are formed, the laplacian matrices of the N graphs obtained are added to averageSetting other weights of the node i except the k neighbor nodes with the maximum weight in the Laplace matrix as 0, and taking the weights as the Laplace matrix of the model 3;
2) calculating sample smoothness: calculating the smoothness degree of the sample data according to the graph model obtained in the step 1) by using a graph laplacian matrix and a concept of signal smoothness in a graph signal theory, wherein the graph laplacian matrix is L-D-W, D is a degree matrix, W is an adjacent matrix, and a formula of the signal smoothness in the graph signal theory is shown as a formula (3):
Figure BDA0002818383100000023
wherein f is (f)1,f2,...fN)TIs a graph signal, L is a Laplace matrix, WijIs a graph signal value fiAnd fjV is the sample node set.
3) Obtaining a dimensionality reduction result: sorting the calculation results obtained in the step 2) in a descending order, and selecting a sampling data composition set corresponding to the maximum first d calculation result values in the sorting order as a dimension reduction result, wherein d is a dimension obtained after dimension reduction of the high-dimensional data set, namely the data set after dimension reduction is X-E RM×d
Compared with the existing feature extraction and feature selection method, the technical scheme utilizes the concept of signal smoothness in the graph signal processing theory to perform dimensionality reduction on data, can retain the features of original data, reduces the influence of irrelevant data or noise on dimensionality reduction results, establishes a network topology between data samples, and retains the correlation between the samples.
The method can keep the correlation among the samples and improve the dimension reduction effect.
Drawings
FIG. 1 is a schematic flow chart of an exemplary method.
Detailed Description
The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.
Implementation example:
referring to fig. 1, a feature selection method based on graph signal processing includes the following steps:
1) constructing a graph model: making the high-dimensional data set X belong to RM×NIs taken as a node on the graph, and each row in the data set represents a node, namely Xi(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., Xj(j ═ 1, 2.. times, N), using the obtained sampling data as a graph signal, i.e. each column in the high-dimensional data set X is a graph signal, and constructing a topological structure on the graph according to the similarity of data between different samples, thereby constructing a graph model between three different sample data, wherein the three graph models are respectively:
model 1: for the ith and the jth nodes, according to the sampling data XiAnd XjThe correlation coefficient between the two is measured, and according to the magnitude of the correlation, knn graph G is constructed by using a nearest neighbor algorithm, that is, each node on knn graph G is connected with only k nodes with the maximum correlation value, wherein the calculation formula of the correlation is shown as formula (1):
Figure BDA0002818383100000031
wherein, EXiAnd D (X)i) Are each XiV is the set of sample points;
model 2: for high-dimensional data set X epsilon RM×NEach column of (1) generates a graph Gl(l 1, 2.. times.n), i.e., one sample data of one graph corresponding to a node, N graphs are formed, and G is measured using the distance between the ith sample and the jth data of the jth samplelAnd (3) constructing knn graph G by using a nearest neighbor algorithm according to the correlation between the node pairs in the graph, namely knn graph G is connected with each node which is only connected with k nodes with the maximum correlation values, wherein the calculation formula of the distance between the sampling data of the samples is shown as formula (2):
Figure BDA0002818383100000032
model 3: model 3 is a variant of model 2, unlike model 2, model 3 is in the context of a high-dimensional dataset X ∈ RM×NGenerates 1 graph G per column of the data set, each sample datal(l ═ 1,2, …, N), after N graphs are formed, adding the obtained laplacian matrices of the N graphs to obtain an average, and setting the weights of the nodes i except the k neighbor nodes with the largest weight in the laplacian matrices as 0, so as to serve as a model 3 laplacian matrix;
2) calculating sample smoothness: calculating the smoothness degree of the sample data according to the graph model obtained in the step 1) by using a graph laplacian matrix and a concept of signal smoothness in a graph signal theory, wherein the graph laplacian matrix is L-D-W, D is a degree matrix, W is an adjacent matrix, and a formula of the signal smoothness in the graph signal theory is shown as a formula (3):
Figure BDA0002818383100000041
wherein f is (f)1,f2,...fN)TIs a graph signal, L is a Laplace matrix, WijIs a graph signal value fiAnd fjAnd V is a sample node set, and for each sample, N sampling data are obtained, that is, each column of the data set corresponds to a graph signal in the graph model, so that N signal smoothness f can be obtainedTThe result of Lf calculation, in general, fTThe smaller the value of Lf, the smoother the signal on the graph, and conversely, the larger the signal fluctuation;
3) obtaining a dimensionality reduction result: sorting the calculation results in the step 2) in a descending order, selecting a sampling data composition set corresponding to the maximum first d calculation result values in the descending order as a dimension reduction result, taking a cancer gene data set as an example, distinguishing a patient from a normal person depending on a few mutant genes, wherein the gene data of the mutant genes may be possibly compared with the gene data of the normal personIn other words, the gene data of the normal person is easy to be smooth, and the gene data of the patient may fluctuate greatly, which can be captured by the smoothness measurement in the formula (3), so that the calculation results of step 2 are sorted in a descending order, and the sampling data corresponding to the maximum previous d values of the result is selected as the dimension reduction result, wherein d is the dimension obtained after the dimension reduction of the high-dimensional data set, that is, the data set after the dimension reduction is X ∈ RM×d
The present example takes four cancer gene expression data sets as examples, specifically:
establishing a graph model G (V, E, W) among patient samples of the high-dimensional data set, and establishing X E R for the high-dimensional data setM×NTaking a data sample as a graph node, taking sample data of the sample as a graph signal, constructing a topological structure on the graph according to the similarity of the sample data among different samples, and establishing three different graph models, wherein in the constructed graph models, G (V, E, W), V (1, 2,.. N) is a graph node set which represents a sample in a high-dimensional data set, and E (E)ijIs the set of the upper side of the graph, eijIndicating that there is an edge connection between node i and node j, W is the weighted adjacency matrix if WijNot equal to 0, the node i and the node j are connected by edges, and the weight is WijOtherwise WijThe degree matrix D is a diagonal matrix with elements D on the diagonal of the diagonal being 0iiEqual to the sum of the elements of row i of the weighted adjacency matrix W, i.e.
Figure BDA0002818383100000042
The laplacian matrix is L ═ D-W,
model 1 employs a Nearest Neighbor Graph (NNG) model, therefore, the method of model 1 is named NNG graph-based feature selection algorithm (NNG-FS);
model 2: for high-dimensional data set X epsilon RM×NEach column of (1) generates a graph Gl(1, 2.. times.n), i.e. one graph corresponds to one sample data, N graphs are generated, and G is measured by using the distance between the ith sample and the jth sample's ith sample datalCorrelation between pairs of nodes in the graph, DlThe smaller the (i, j), the higher the correlation between nodes, and the larger the correlationSmall, knn graph G is constructed by adopting a nearest neighbor algorithm, namely, each node on knn graph G is only connected with k nodes with the maximum correlation values, and model 2 utilizes Multiple Distance Graphs (MDGs), namely graphs with weights characterized by the distance between sampling data, so that the method called model 2 is a feature selection method (MDG-FS) based on MDG;
model 3: model 3 is in the high-dimensional data set X ∈ RM×NGenerates 1 graph G per column of the data set, each sample datal(l ═ 1, 2.,. N), after N graphs are formed, the obtained laplacian matrices of the N graphs are added to calculate an average, and in the laplacian matrices, the weights of nodes i except for k neighbor nodes with the largest weight are set to 0, which is used as a model 3 laplacian matrix, and similarly, the model 3 utilizes a Combined Distance Graph (CDG), which is called a CDG-based feature selection method (CDG-FS);
simulation example:
the method of this example was applied to four cancer gene datasets downloaded from a common gene database and compared to LLE and PCA, the information for the datasets being listed in table 1:
TABLE 1 Gene data set information
Figure BDA0002818383100000051
In the simulation experiment, the parameters of the method and LLE in the example are taken as k being 3, d being 20, PCA parameters are set as d being 20, then given characteristic vectors, self-contained Naive Bayes, Support Vector Machine (SVM), Random Forest, KNN and Discriminiant Analysis classifiers in MATLAB are used for detecting the dimensionality reduction effect, wherein the ratio of the training set to the testing set is 1: 2, the classification accuracy of the method, LLE and PCA in the example to four cancer data sets is shown in tables 2, 3, 4 and 5,
TABLE 2 Classification accuracy of Prostate Tumor datasets
Figure BDA0002818383100000052
TABLE 3 Classification accuracy of Brain Tumor datasets
Figure BDA0002818383100000061
TABLE 4 Classification accuracy of Gastric data sets
Figure BDA0002818383100000062
TABLE 5 sorting accuracy of Lung Cancer data set
Figure BDA0002818383100000063
From the observations in tables 2 to 5, it can be seen that, on the data sets of the prestate Tumor, Brain Tumor and scientific Cancer, the classification performance of the method of the present embodiment is significantly better than that of LLE and PCA, but for the prestate Tumor, the classification performance of PCA using KNN classifier is the best, LLE and PCA are prominent in lung Cancer, and overall, the performance of the method of the present embodiment is better than that of the comparative algorithm, and in addition, the three algorithms show different performances on the data sets of different classes, the treatment effect of CDGGS algorithm on Brain tumors is better than that of the other two methods, and the treatment effect of NNG-GS algorithm on stomach Cancer is the best, and in addition, the classification accuracy of different algorithms is large, for example, the classification accuracy of MDG-GS is at least 7.25% higher than that of the other two algorithms, and the experimental results show that the graph model is the key to achieving good classification performance due to the method using different graph models.

Claims (1)

1. A feature selection method based on graph signal processing is characterized by comprising the following steps:
1) constructing a graph model: making the high-dimensional data set X belong to RM×NIs taken as a node on the graph, and each row in the data set represents a node, namely Xi(i 1, 2.. times.m), each column representing one sample data of all nodes, N sample data, i.e., Xj(j ═ 1, 2.., N), the product obtainedSampling data serving as a graph signal, namely each column in a high-dimensional data set X is a graph signal, and a topological structure on a graph is constructed according to the similarity of data among different samples, so that a graph model among three different sample data is constructed, wherein the three graph models are respectively as follows:
model 1: for the ith and the jth nodes, according to the sampling data XiAnd XjThe correlation coefficient between the two is measured, and according to the magnitude of the correlation, knn graph G is constructed by using a nearest neighbor algorithm, that is, each node on knn graph G is connected with only k nodes with the maximum correlation value, wherein the calculation formula of the correlation is shown as formula (1):
Figure FDA0002818383090000011
wherein, EXiAnd D (X)i) Are each XiV is the set of sample points;
model 2: for high-dimensional data set X epsilon RM×NEach column of (1) generates a graph Gl(l 1, 2.. times.n), i.e., one sample data of one graph corresponding to a node, N graphs are formed, and G is measured using the distance between the ith sample and the jth data of the jth samplelAnd (3) constructing knn graph G by using a nearest neighbor algorithm according to the correlation between the node pairs in the graph, namely knn graph G is connected with each node which is only connected with k nodes with the maximum correlation values, wherein the calculation formula of the distance between the sampling data of the samples is shown as formula (2):
Figure FDA0002818383090000012
model 3: model 3 is in the high-dimensional data set X ∈ RM×NGenerates 1 graph G per column of the data set, each sample datal(l ═ 1, 2.... An, N), after N graphs are formed, the obtained Laplace matrixes of the N graphs are added to calculate the average, and the division weight of the node i in the Laplace matrix is the maximumSetting other weights except the k neighbor nodes as 0, and taking the weights as a model 3 Laplacian matrix;
2) calculating sample smoothness: calculating the smoothness degree of the sample data according to the graph model obtained in the step 1) by using a graph laplacian matrix and a concept of signal smoothness in a graph signal theory, wherein the graph laplacian matrix is L-D-W, D is a degree matrix, W is an adjacent matrix, and a formula of the signal smoothness in the graph signal theory is shown as a formula (3):
Figure FDA0002818383090000013
Figure FDA0002818383090000014
wherein f is (f)1,f2,...fN)TIs a graph signal, L is a Laplace matrix, WijIs a graph signal value fiAnd fjV is a sample node set;
3) obtaining a dimensionality reduction result: sorting the calculation results obtained in the step 2) in a descending order, and selecting a sampling data composition set corresponding to the maximum first d calculation result values in the sorting order as a dimension reduction result, wherein d is a dimension obtained after dimension reduction of the high-dimensional data set, namely the data set after dimension reduction is X-E RM×d
CN202011405315.6A 2020-12-04 2020-12-04 Feature selection method based on graph signal processing Pending CN112418142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011405315.6A CN112418142A (en) 2020-12-04 2020-12-04 Feature selection method based on graph signal processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011405315.6A CN112418142A (en) 2020-12-04 2020-12-04 Feature selection method based on graph signal processing

Publications (1)

Publication Number Publication Date
CN112418142A true CN112418142A (en) 2021-02-26

Family

ID=74830169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011405315.6A Pending CN112418142A (en) 2020-12-04 2020-12-04 Feature selection method based on graph signal processing

Country Status (1)

Country Link
CN (1) CN112418142A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779882A (en) * 2021-09-10 2021-12-10 中国石油大学(北京) Method, device, equipment and storage medium for predicting residual service life of equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779882A (en) * 2021-09-10 2021-12-10 中国石油大学(北京) Method, device, equipment and storage medium for predicting residual service life of equipment
CN113779882B (en) * 2021-09-10 2024-05-17 中国石油大学(北京) Method, device, equipment and storage medium for predicting residual service life of equipment

Similar Documents

Publication Publication Date Title
CN109409416B (en) Feature vector dimension reduction method, medical image identification method, device and storage medium
Montazer et al. An improved radial basis function neural network for object image retrieval
CN110929029A (en) Text classification method and system based on graph convolution neural network
CN107133496B (en) Gene feature extraction method based on manifold learning and closed-loop deep convolution double-network model
KR101687217B1 (en) Robust face recognition pattern classifying method using interval type-2 rbf neural networks based on cencus transform method and system for executing the same
CN110866439B (en) Hyperspectral image joint classification method based on multi-feature learning and super-pixel kernel sparse representation
CN108960341B (en) Brain network-oriented structural feature selection method
CN108921853B (en) Image segmentation method based on super-pixel and immune sparse spectral clustering
CN110910325B (en) Medical image processing method and device based on artificial butterfly optimization algorithm
CN107578063B (en) Image Spectral Clustering based on fast selecting landmark point
CN111611293A (en) Outlier data mining method based on feature weighting and MapReduce
CN108388869B (en) Handwritten data classification method and system based on multiple manifold
CN112418142A (en) Feature selection method based on graph signal processing
Abboud et al. Biometric templates selection and update using quality measures
CN112233742B (en) Medical record document classification system, equipment and storage medium based on clustering
Baswade et al. A comparative study of k-means and weighted k-means for clustering
CN113793667A (en) Disease prediction method and device based on cluster analysis and computer equipment
Biryukova et al. Development of the effective set of features construction technology for texture image classes discrimination
CN112287036A (en) Outlier detection method based on spectral clustering
CN105760872B (en) A kind of recognition methods and system based on robust image feature extraction
Little et al. An analysis of classical multidimensional scaling
CN109063766B (en) Image classification method based on discriminant prediction sparse decomposition model
CN108256569B (en) Object identification method under complex background and used computer technology
CN114037931B (en) Multi-view discriminating method of self-adaptive weight
Hoti et al. A semiparametric density estimation approach to pattern classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226