CN110827921B - Single cell clustering method and device, electronic equipment and storage medium - Google Patents

Single cell clustering method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110827921B
CN110827921B CN201911097854.5A CN201911097854A CN110827921B CN 110827921 B CN110827921 B CN 110827921B CN 201911097854 A CN201911097854 A CN 201911097854A CN 110827921 B CN110827921 B CN 110827921B
Authority
CN
China
Prior art keywords
single cell
matrix
similarity
node
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911097854.5A
Other languages
Chinese (zh)
Other versions
CN110827921A (en
Inventor
朱晓姝
彭小清
李洪东
王建新
郭立渌
李剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Yulin Normal University
Original Assignee
Central South University
Yulin Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University, Yulin Normal University filed Critical Central South University
Priority to CN201911097854.5A priority Critical patent/CN110827921B/en
Publication of CN110827921A publication Critical patent/CN110827921A/en
Application granted granted Critical
Publication of CN110827921B publication Critical patent/CN110827921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a single cell clustering method, a single cell clustering device, electronic equipment and a storage medium. And constructing a global feature space on the basis of calculating the local similarity between the node pairs based on the distance information. And calculating the global similarity between the node pairs by using a multi-core learning method based on the global feature space. And then, expanding nodes on all second-order paths of the considered node pairs, adding more related node information, and constructing a more effective global similarity calculation method. And finally, sequencing the nodes according to the node degrees, and determining the order of the nodes initially added into the community, so that the Louvain community detection method is improved, and clustering is performed by using the method. The method is simple and effective, and compared with other methods, tests on a common single-cell transcriptome sequencing data set show that the method has better prediction performance in the aspect of single-cell transcriptome sequencing data clustering.

Description

Single cell clustering method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of bioinformatics, in particular to a single cell clustering method and a single cell clustering device, which are used for identifying cell types and providing a basis for analyzing a cell differentiation process.
Background
The traditional large-scale cell sequencing method is difficult to be applied to the research field needing to consider the individual characteristics of cells. And single-cell transcriptome sequencing data (scRNA-seq data) can be better suitable for researching cell difference and identifying cell types in the process of researching cell differentiation. With the rapid development of the scRNA-seq technology, the scRNA-seq data can more accurately reflect the gene expression data of each cell, and reduce the influence of different cells on controlling gene expression, cell behavior and cell types. However, scRNA-seq data is characterized by high dimensionality, small samples, and lack of a priori knowledge.
At present, two major methods of cell type identification, namely supervised learning and unsupervised learning, are mainly used as a method for identifying cell types (single cell clustering) based on scRNA-seq data. A supervised learning method is adopted for cell classification research, the performance strongly depends on a regularization strategy and priori knowledge, and the bottleneck that the prior knowledge needs to be enriched is faced for the current unknown cell clustering research. The unsupervised learning method is adopted for cell clustering research, prior knowledge is not needed, automatic estimation of classification category number can be achieved, and further research is needed in the aspects of category number estimation, clustering accuracy and robustness. Therefore, there is a need to design a method and apparatus for identifying cell types (single cell clustering) to improve the accuracy and robustness thereof.
Disclosure of Invention
The invention solves the technical problem that aiming at the defects of the prior art, the invention provides a single cell clustering method and a single cell clustering device, which can automatically estimate the clustering category number on the premise of lacking prior knowledge, fully utilize global similarity information and improve the clustering performance; simple and effective, and easy to implement.
The technical scheme provided by the invention is as follows:
a single cell clustering method, comprising the steps of:
step1, based on gene expression matrix (i.e. single cell transcription)Gene expression matrix of group sequencing data, wherein the gene expression matrix is single cell and is used for behavior gene expression quantity, and can be downloaded from open database), calculating similarity between single cell pairs, and constructing global characteristic space matrix Sl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) indicates the similarity (local similarity) of single cell i and single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function to construct a sparse global similarity matrix Sg
Step3, based on the matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
Step 4, firstly based on the matrix SpConstructing graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); similarity matrix SpIs the similarity matrix of the nodes in the graph G, and is also the weight matrix of the edges in the graph G; then clustering each node in the graph G to complete single cell clustering;
in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow i and column j.
A single cell clustering device comprises the following data acquisition module, a similarity matrix construction module and a clustering module:
the data acquisition module is used for acquiring gene expression data;
a similarity matrix construction module for constructing a path-based global similarity matrix SpThe implementation process is as follows:
step1, based on the gene expression matrix, calculating the similarity between the single cell pairs and constructing a global characteristic space matrix Sl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) represents the similarity (local similarity) of single cell i and single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function to construct a sparse global similarity matrix Sg
Step3, based on matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
The clustering module is used for realizing single cell clustering, and the realization process is as follows:
step 4, firstly based on the matrix SpConstruct graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); similarity matrix SpIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then, clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;
in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow ith and column jth.
An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the method of single-cell clustering as described above.
A storage medium having stored thereon a computer program which, when executed by a processor, implements the single-cell clustering method described above.
Has the advantages that:
the invention is based on the assumption that global information is more favorable for accurately calculating the similarity than local information, and more associated node information and global information of the graph are fused. And firstly, calculating the local distance between the nodes so as to construct a global feature space. And then, selecting KNN by using a multi-core similarity calculation method, removing the nodes with weak correlation, calculating the global similarity among the nodes based on the global feature space, and denoising to obtain a sparse global similarity matrix. And then, a path-based similarity calculation method is adopted to extend all nodes on the second-order path length, so that a more effective global similarity calculation method is constructed. And finally, clustering is carried out based on an improved Louvain community detection method (the initial community adding sequence of the nodes is determined according to the degree sequence of the nodes). The invention can fully utilize global information and improve clustering performance. The clustering method can effectively find the cell subgroup. The invention is simple and effective, and compared with other methods, tests on a public data set show that the invention has better prediction performance in the aspect of single-cell transcriptome sequencing data clustering.
Drawings
FIG. 1 is a flowchart of an embodiment of the present invention (Multi-kernel and path-based global similarity _ Louvain, MPGS _ Louvain).
FIG. 2 is a clustering result heatmap based on the Deng data set, comparing clustering performance with different path lengths for embodiments of the present invention. Fig. 2(a) is a clustering result heatmap with a path length of 2, and fig. 2(b) is a clustering result heatmap with a path length of 3.
Fig. 3 is a chart of clustering results heatmap based on Pollen dataset comparing clustering performance for different path lengths for embodiments of the present invention. Fig. 3(a) is a clustering result heatmap with a path length of 2, and fig. 3(b) is a clustering result heatmap with a path length of 3.
FIG. 4 is a clustering result heatmap based on Goolam datasets comparing clustering performance for embodiments of the present invention with different path lengths. Fig. 4(a) is a clustering result heat map with a path length of 2, and fig. 4(b) is a clustering result heat map with a path length of 3.
FIG. 5 is a similarity results heatmap based on the Deng dataset comparing different similarity calculation methods with embodiments of the present invention. FIG. 5(a) is Euclidean distance, FIG. 5(b) is Spearman correlation coefficient, FIG. 5(c) is SIMLR, and FIG. 5(d) is MPGS.
Fig. 6 compares the various similarity calculation methods to embodiments of the present invention based on the similarity results heatmap of the Pollen dataset. FIG. 6(a) is Euclidean distance, FIG. 6(b) is Spearman correlation coefficient, FIG. 6(c) is SIMLR, and FIG. 6(d) is MPGS.
Fig. 7 compares various similarity calculation methods to embodiments of the present invention based on a similarity results heatmap of the gooolam dataset. FIG. 7(a) is Euclidean distance, FIG. 7(b) is Spearman correlation coefficient, FIG. 7(c) is SIMLR, and FIG. 7(d) is MPGS.
FIG. 8 compares the clustering method for different scRNA-seq data with the present invention examples based on NMI and ARI evaluation index. Fig. 8(a) is a box plot of the clustering result NMI values of different clustering methods, and fig. 8(b) is a box plot of the clustering result ARI values of different clustering methods.
Fig. 9 is a comparison of clustering performance for an embodiment of the present invention for the case of artificial perturbation based on NMI and ARI evaluation indices. Fig. 9(a) is an NMI comparison of manual perturbation to clustering results, and fig. 9(b) is an ARI comparison of manual perturbation to clustering results.
Detailed Description
The present invention will be further described in detail with reference to the drawings and specific examples.
Example 1:
the embodiment provides a single cell clustering method, which comprises the following steps:
step1, Gene expression matrix based (i.e., Gene expression from Single cell transcriptome sequencing data)Matrix, wherein the single cell is listed, the expression quantity of the behavior gene can be downloaded from a public database), the similarity between the single cell pairs is calculated, and a global characteristic space matrix S is constructedl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) indicates the similarity (local similarity) of single cell i and single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function to construct a sparse global similarity matrix Sg
Step3, based on the matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
Step 4, firstly based on the matrix SpConstructing graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); similarity matrix SpIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then clustering each node in the graph G to complete single cell clustering;
in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow i and column j.
The graph G in step 4 can also be obtained by the following method: firstly, taking a gene expression matrix of scRNA-seq data as input, taking cells as nodes, taking gene expression data as attributes of the nodes, and connecting every two nodes through an edge to construct a complete graph; then according to the stepsStep 1-3, calculating to obtain a global similarity matrix Sp(ii) a Then according to SpThe element values in (1) update the weights of the corresponding edges in the complete graph; and finally, removing the edge with the weight of 0 in the graph to obtain the graph G in the step 4.
Example 2:
based on embodiment 1, in step1, based on the gene expression matrix, the single cell clustering method of this embodiment calculates Spearman correlation coefficient (Spearman correlation coefficient) between column vectors corresponding to two single cells as their similarity; the specific steps of calculating the Spearman correlation coefficient Rs (i, j) of the column vectors corresponding to the single cell i and the single cell j are as follows:
step 1: converting elements S (M, i) and S (M, j) corresponding to column vectors S (: i) and S (: j) corresponding to the single cell i and the single cell j in a gene expression matrix S into ranks (descending positions) in the respective column vectors, which are denoted as R [ S (M, i) ] and R [ S (M, j) ], wherein M is 1,2, …, M is the number of genes, wherein the gene expression matrix S is a matrix of M rows and N columns, and N is the number of single cells;
step2 the difference between the elements S (m, i) and S (m, j) corresponding to the two column vectors S (: i) and S (: j) is calculated and added according to the following formula:
Figure GDA0003493164720000051
step 3: finally, the spearman correlation coefficient Rs between S (: i) and S (: j) is calculated according to the following formula:
Figure GDA0003493164720000052
finally order Sl(i,j)=Rs(i,j)。
Example 3:
based on the single cell clustering method in this embodiment 2, the step2 is specifically realized by the following steps:
step 2.1, based on the global characteristic spaceSpace matrix SlDetermining K Nearest Neighbors (KNN) of each single cell according to the size of the element value of each row in the matrix; for a single cell i, matrix SlIth row Sl(i), (ii) removing Sl(i, i) the single cells corresponding to the maximum K elements except the single cells are K nearest neighbors, and a set formed by the K nearest neighbors is recorded as KNN (i); wherein Sl(i, i) denotes Sl(i) the i-th element, i.e. SlRow i, column i; the above operation is used to apply to the global feature space matrix SlFiltering out nodes with weak correlation on the basis;
step 2.2, calculating the weighted Gaussian kernel similarity D (i, j) between each single cell pair by using the weighted Gaussian kernel function as the similarity S of the single cellsg(i, j) to obtain a similarity matrix Sg(ii) a The similarity between pairs of individual cells is calculated using a weighted Gaussian kernel function to filter data noise, matrix SgThe global similarity between single cells can be reflected, and is a sparse global similarity matrix;
in this step, the mean value mu of the distances between the single cell i and the single cell j and the K nearest neighbors thereof is calculated respectivelyiAnd mujThen measured by muiAnd mujThe width parameter of the gaussian kernel function is determined on the basis of the mean value of (a).
Using a weighted gaussian kernel function, the method of calculating a weighted gaussian kernel similarity D (i, j) between a single cell i and a single cell j is:
Figure GDA0003493164720000061
Figure GDA0003493164720000062
Figure GDA0003493164720000063
wherein, ω islExpressing the ith Gaussian kernel function Kl(si,sj) The weight of (2) can be taken according to experience; epsilonijThe width parameter is a width parameter of the Gaussian kernel function and is determined by a parameter pair (sigma, K), and different width parameters are determined by different parameter pair values, so that different Gaussian kernel functions are obtained; s (: i) and S (: j) represent the ith and jth columns of the gene expression matrix, respectively.
Example 4:
the single cell clustering method of this example is based on example 3, and ω in step 2.2 is determined by the following stepslThe value of (A) is as follows:
firstly, omega is firstlyl1,2, …, G is initialized to
Figure GDA0003493164720000071
Wherein G is the number of Gaussian kernel functions; then the optimal solution is obtained by iteratively solving the following objective functions:
Figure GDA0003493164720000072
constraint conditions are as follows:
LTL=IC
Figure GDA0003493164720000073
Figure GDA0003493164720000074
wherein, INIs an NxN identity matrix, IcIs a C multiplied by C unit matrix, C is a classification number, L is an Nmultiplied by C rank constraint matrix, beta, gamma and rho are non-negative empirical parameters,
Figure GDA0003493164720000075
the Frobenius norm of a matrix is referred to as F-norm, and D is a matrix with D (i, j) as the element of the ith row and the jth column.
Example 5:
based on embodiment 3, in step 2.2, the single cell clustering method of this embodiment uses a weighted gaussian kernel function to calculate weighted gaussian kernel similarity D (i, j) between each single cell pair, and is implemented by SIMLR (single-cell interpretation via multi-kernel learning, which is a single cell interpretation tool based on multi-core learning, and can calculate similarity between cells).
Example 6:
in the single-cell clustering method of this embodiment, on the basis of embodiment 5, the value range of the parameter σ is set to {1.0,1.25,1.5,1.75,2}, and the value range of the parameter K is set to {10,12,14, once, 30}, so that 55 different parameter pair values can be obtained, 55 different width parameters are obtained, and 55 different gaussian kernel functions are obtained.
Example 7:
based on the single cell clustering method in this embodiment 6, the step3 is specifically realized by the following steps:
step 3.1, expanding the path length and determining a second-order path;
sparse-based global similarity matrix SgRespectively determining K nearest neighbors of the single cell i and the single cell j according to the size of each row of element values in the matrix, and determining the common nearest neighbor of the single cell i and the single cell j; taking a nearest neighbor shared by the single cell i and the single cell j as an intermediate node, and connecting a path with the length of 2 of the single cell i and the single cell j as a second-order path;
step 3.2, constructing a global similarity matrix S based on the path based on the second-order pathp
Calculating the sum of the distances between the intermediate nodes on all second-order paths connecting the single cell i and the single cell j and the single cell i and the single cell j, namely a global similarity matrix SgThe sum of corresponding elements in the solution is calculated by the following formula:
computing
Figure GDA0003493164720000081
Wherein S isg(i, k) representsMatrix SgElement of ith row and kth column, Sg(j, k) represents a matrix SgRow jth and column kth.
For the directly connected node pairs, the path length is expanded, namely the node range participating in the node similarity calculation is expanded, and the node range is expanded to the node on the longer path connecting the two nodes by only considering the two nodes, so that the calculated similarity data can better reflect the global information. Fig. 2-4 show the comparison of the clustering effect of the method on 3 commonly used data sets when different path lengths are taken (2 and 3, respectively). The result shows that when the path lengths are selected to be 2 and 3, the performance is similar, so that the expansion path length is selected to be 2 in the embodiment, the calculated similarity data can better reflect global information, and the calculation amount is also considered.
Example 8:
based on the single cell clustering method in this embodiment 7, the step 4 is specifically implemented by the following steps:
first, edges in the graph G are determined: if Sp(i, j) ≠ 0, then node viAnd node vjThere is an edge between them, and the weight w of this edgeij=Sp(i, j), otherwise, node viAnd node vjThere is no edge in between;
then, clustering the nodes in the graph G by using a Louvain community detection method, which comprises the following steps:
step 4.1, calculating the degree of each node in the graph G (the degree of one node is the number of edges connected with the node), sorting the nodes according to the degree, and sequentially recording the sorted nodes as v'1,v′2,…,v′N
Step 4.2, iteratively calculating a clustering result according to the modularity maximization principle; the method comprises the following specific steps:
1) initializing each node in the graph G into a community; calculating the modularity function value at the moment and storing the value into Q1
2) Let Q2=Q1
3) Let i equal to 1;
4) successive assumptionsV'iThe community and each neighbor node (and v'iNodes connected with edges, namely neighbor nodes) are merged, and the relative Q of the modularity function values under various merging modes is calculated respectively2If Δ Q is greater than 0 in the presence of at least one merge mode, selecting the merge mode in which Δ Q is the greatest, and v 'in the merge mode'iMerging the community with the community of the corresponding neighbor node, otherwise, not merging;
5) judging whether i is equal to N, if so, calculating the modularity function value at the moment and storing the modularity function value in Q1And go to step 6), otherwise, let i ═ i +1, and return to step 4);
6) judging whether Q exists1>Q2If yes, turning to the step 2) for iteration; otherwise, the modularity function value can not be increased any more and reaches the maximum, the iteration is ended, and the community division result at the moment is output, namely the subgroup obtained by clustering the single cells;
wherein the modularity function is:
Figure GDA0003493164720000091
in the formula, AijDenotes a connection node v'iAnd v 'node'jIs the weight of the side of if node v'iAnd v 'node'jThere is no edge connection between them, then Aij=0;KiAnd KjAre respectively represented by and node v'iAnd node v'jThe sum of the weights of the connected edges; c. CiAnd cjRespectively represent node v'iAnd v 'node'jThe community in which the user is located;
Figure GDA0003493164720000092
denotes ciAnd cjWhether it is the same community, if ciAnd cjAre the same community, and are the same,
Figure GDA0003493164720000093
otherwise
Figure GDA0003493164720000094
Figure GDA0003493164720000095
Is the sum of the weights of all edges in graph G.
According to the degree-centrality rule, the nodes with higher degree are more important, so the degree of each node is calculated in the step, the nodes are sorted according to the degree of the nodes, and then a sequenced Louvain community detection method is used according to the modularity maximization principle (namely in the iterative calculation process, the method for selecting the nodes randomly is changed into the method for selecting the nodes according to the node sequence, and is an improved Louvain community detection method), natural partitions are searched to realize clustering, and the clustering speed can be increased.
The specific implementation process of the single cell clustering method in this embodiment is shown in fig. 1.
Example 9:
the embodiment provides a single cell clustering device, which comprises a data acquisition module, a similarity matrix construction module and a clustering module, wherein the data acquisition module comprises:
the data acquisition module is used for acquiring gene expression data;
a similarity matrix construction module for constructing a path-based global similarity matrix SpThe implementation process is as follows:
step1, calculating the similarity between single cell pairs based on a gene expression matrix, and constructing a global characteristic space matrix Sl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) represents the similarity (local similarity) of single cell i and single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function to construct a sparse global similarity matrix Sg
Step3, based on the matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
The clustering module is used for realizing single cell clustering, and the realization process is as follows:
step 4, firstly based on the matrix SpConstructing graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); similarity matrix SpIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then, clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;
in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow i and column j.
The implementation process of the steps can adopt the method given in any one of the above embodiments.
Example 10:
the present embodiment provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to implement the single cell clustering method described in any one of the above embodiments.
Example 11:
the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the single-cell clustering method described in any one of the above embodiments.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application.
Experimental verification
1. Evaluation index
The clusters predicted by the embodiment (MPGS _ Louvain) of the invention cannot be all the same as the clusters of the marks, and NMI and ARI are common clustering performance evaluation indexes and can be used for evaluating the consistency between the predicted clustering results and the clustering results of the known marks. Therefore NMI and ARI are suitable for evaluating the predictive clustering performance of MPGS.
In addition, in addition to NMI and ARI, heatmaps (heatmaps) can be used to visually reflect cell clustering results and related highly expressed genes, and are useful for evaluating the performance of similarity calculations. Therefore, the present experiment also uses heat maps to represent and evaluate the similarity calculation performance of the MPGS method.
2. Comparison with other similarity calculation methods
To evaluate the effectiveness of the clustering method proposed in the present embodiment, the similarity calculation Method (MPGS) in the present embodiment was compared with three other similarity calculation methods (SIMLR). Among them, Euclidean distance and Spearman correlation similarity calculation are classical similarity calculation methods, and SIMLR is a new similarity calculation method with high performance. The heat map of the results of the similarity calculations is shown in FIGS. 5-7, where (a) is Euclidean distance, (b) is Spearman correlation coefficient, (c) is SIMLR, and (d) is MPGS. In the heat map, white indicates a high similarity value and black indicates a low similarity value. As can be seen from FIGS. 5-7, direct use of the two classical similarity calculations Euclidean distance and Spearman correlation coefficient results in a heatmap in which the blocky structures are not evident. When the SIMLR and MPGS methods are used, the result heat map has more obvious block structures, wherein the block structures of the result heat map of the MPGS are clearer and have no abnormal points. Therefore, the method is superior to other similarity calculation methods.
3. Comparison with other clustering methods
In order to evaluate the effectiveness of the clustering method proposed by the present invention, the example of the present invention (MPGS _ Louvain) was compared with five other similarity calculation methods (NMF, SNN-cliq, SIMLR, SSE, SSNN-Louvain). The clustering experiment results are shown in fig. 8, in which fig. 8(a) is a box plot of the clustering results NMI values of different clustering methods, and fig. 8(b) is a box plot of the clustering results ARI values of different clustering methods. As can be seen from fig. 8, the MPGS method has the highest maximum value, median value, and two quartile values among the two evaluation indexes of NMI and ARI, and has no outlier. Therefore, the invention is superior to other clustering methods.
4. Effect of Artificial perturbation on the Performance of the invention
In order to analyze and evaluate the influence of manual disturbance on the clustering performance of the embodiment of the invention, the robustness of the embodiment of the invention is verified. Here, the experiment was run 100 times with 95% of the data selected using random seeds. And calculating the mean value and standard deviation of the clustering result NMI and ARI evaluation index, as shown in fig. 9, fig. 9(a) is the NMI comparison of the artificial disturbance to the clustering result, and fig. 9(b) is the ARI comparison of the artificial disturbance to the clustering result. As can be seen from fig. 9, the mean of 100 runs was very close to the result of MPGS, with standard deviations less than 5%. Therefore, the method is insensitive to manual disturbance and has better robustness.

Claims (10)

1. A single cell clustering method is characterized by comprising the following steps:
step1, calculating the similarity between single cell pairs based on a gene expression matrix, and constructing a global characteristic space matrix Sl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) indicates the similarity of a single cell i and a single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlComputing similarity between pairs of single cells using weighted Gaussian kernel functions to construct a sparse globalSimilarity matrix Sg
Step3, based on the matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
Step 4, firstly based on the matrix SpConstructing graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); then clustering each node in the graph G to complete single cell clustering;
in the above steps, X (i, j) represents the element in ith row and jth column of the matrix X, where i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow i and column j.
2. The single cell clustering method according to claim 1, wherein in step1, the spearman correlation coefficient between the column vectors corresponding to two single cells is calculated as their similarity, i.e. S, based on the gene expression matrixlCalculating the Spanish correlation coefficient Rs (i, j) of the column vectors of the single cell i and the single cell j in the gene expression matrix as Sl(i,j),i,j=1,2,…,N。
3. The method for single cell clustering according to claim 1, wherein the step2 comprises the steps of:
step 2.1, based on the global characteristic space matrix SlDetermining K nearest neighbors of each single cell according to the size of each row of element values in the matrix; for a single cell i, matrix SlIth row Sl(i), (ii) removing Sl(i, i) the single cell corresponding to the largest K elements except the K elements, namely the K nearest neighbors, and the K nearest neighbors form a setCombined as KNN (i);
step 2.2, calculating the weighted Gaussian kernel similarity between each single cell pair by using the weighted Gaussian kernel function, and taking the similarity as the similarity, namely SgThe value of the corresponding element;
using a weighted gaussian kernel function, the method of calculating a weighted gaussian kernel similarity D (i, j) between a single cell i and a single cell j is:
Figure FDA0002268898720000021
Figure FDA0002268898720000022
Figure FDA0002268898720000023
wherein, ω islExpressing the ith Gaussian kernel function Kl(si,sj) Weight of (e ∈)ijThe width parameter is a width parameter of the Gaussian kernel function and is determined by a parameter pair (sigma, K), and different width parameters are determined by different parameter pair values, so that different Gaussian kernel functions are obtained; s (: i) and S (: j) represent the ith and jth columns of the gene expression matrix, respectively;
order Sg(i,j)=D(i,j)。
4. The method for single cell clustering according to claim 3, wherein in step 2.2, ω is first determinedlIs initialized to
Figure FDA0002268898720000024
Wherein G is the number of Gaussian kernel functions; then the optimal solution is obtained by iteratively solving the following objective functions:
Figure FDA0002268898720000025
constraint conditions are as follows:
LTL=IC
Figure FDA0002268898720000026
Figure FDA0002268898720000027
wherein, INIs an NxN identity matrix, IcIs a C multiplied by C unit matrix, C is a classification number, L is an Nmultiplied by C rank constraint matrix, beta, gamma and rho are non-negative empirical parameters,
Figure FDA0002268898720000028
the Frobenius norm of the matrix is denoted for short as the F-norm.
5. The single cell clustering method according to claim 3, wherein the value range of the parameter σ is set to {1.0,1.25,1.5,1.75,2}, and the value range of the parameter K is set to {10,12,14,.., 30}, so as to obtain 55 different parameter pairs, obtain 55 different width parameters, and obtain 55 different gaussian kernel functions.
6. The method for single cell clustering according to claim 1, wherein the step3 comprises the steps of:
step 3.1, determining the common nearest neighbor of each single cell pair;
first, based on the global similarity matrix S of sparsegRespectively determining K nearest neighbors of the single cell i and the single cell j according to the size of each row of element values in the matrix; for a single cell i, matrix SgIth row Sg(i,: removing Sg(i, i) the single cells corresponding to the maximum K elements except the single cells are K nearest neighbors, and a set formed by the K nearest neighbors is recorded as KNN (i); for single cellj, matrix SgJ row Sg(j,: wherein S is removedg(j, j) the single cells corresponding to the largest K elements except the single cells, namely K nearest neighbors, and the set formed by the K nearest neighbors is marked as KNN (j); then determining the nearest neighbor set KNN (i) and KNN (j) shared by the single cell i and the single cell j;
step 3.2, constructing a global similarity matrix S based on pathsp
Calculating S according to the following formulapElement S in ith row and jth columnp(i,j):
Figure FDA0002268898720000031
Wherein S isg(i, k) represents a matrix SgElement of ith row and kth column, Sg(j, k) represents a matrix SgRow jth and column kth.
7. The method for clustering single cells according to claim 1, wherein in the step 4, the method for determining the edges in the graph G is as follows: if Sp(i, j) ≠ 0, then node viAnd node vjThere is an edge between them, and the weight w of this edgeij=Sp(i, j), otherwise, node viAnd node vjThere is no edge in between;
in the step 4, clustering the nodes in the graph G by using a Louvain community detection method includes the following steps:
step 4.1, calculating the degree of each node in the graph G, sequencing the nodes according to the degree, and sequentially recording the sequenced nodes as v'1,v′2,…,v′N
Step 4.2, iteratively calculating a clustering result according to the modularity maximization principle; the method comprises the following specific steps:
1) initializing each node in the graph G into a community; calculating the modularity function value at the moment and storing the value into Q1
2) Let Q2=Q1
3) Let i equal to 1;
4) let v 'be assumed in turn'iThe community where the node is located is merged with the community where each neighbor node is located, and the relative Q of the modularity function values under various merging modes is calculated respectively2If Δ Q is greater than 0 in the presence of at least one merge mode, selecting the merge mode in which Δ Q is the greatest, and v 'in the merge mode'iMerging the located community with the community where the corresponding neighbor node is located, otherwise, not merging;
5) judging whether i is equal to N, if so, calculating the modularity function value at the moment and storing the modularity function value in Q1And go to step 6), otherwise, let i ═ i +1, and return to step 4);
6) judging whether Q exists1>Q2If yes, turning to the step 2) for iteration; otherwise, the modularity function value can not be increased any more and reaches the maximum, the iteration is ended, and the community division result at the moment is output, namely the single cell clustering result;
wherein the modularity function is:
Figure FDA0002268898720000041
in the formula, AijDenotes a connection node v'iAnd v 'node'jIs the weight of the side of if node v'iAnd v 'node'jThere is no edge connection between them, then Aij=0;KiAnd KjAre respectively represented by and node v'iAnd v 'node'jThe sum of the weights of the connected edges; c. CiAnd cjRespectively represent node v'iAnd v 'node'jThe community in which it is located;
Figure FDA0002268898720000042
denotes ciAnd cjWhether it is the same community, if ciAnd cjAre the same community, and are the same,
Figure FDA0002268898720000043
otherwise
Figure FDA0002268898720000044
Figure FDA0002268898720000045
Is the sum of the weights of all edges in graph G.
8. The single cell clustering device is characterized by comprising the following data acquisition module, a similarity matrix construction module and a clustering module:
the data acquisition module is used for acquiring gene expression data;
a similarity matrix construction module for constructing a path-based global similarity matrix SpThe implementation process is as follows:
step1, based on gene expression data, calculating the similarity between single cell pairs, and constructing a global characteristic space matrix Sl;SlIs NXN, N is the number of single cells, wherein the element Sl(i, j) indicates the similarity of a single cell i and a single cell j, SlEach row in the list represents the similarity between a single cell and all other single cells, SlThe global similarity information is included and is a global feature space matrix;
step2, based on the matrix SlCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function, and constructing a sparse global similarity matrix Sg
Step3, based on the matrix SgDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on pathsp
The clustering module is used for realizing single cell clustering, and the realization process is as follows:
step 4, firstly based on the matrix SpConstructing graph G ═ (V, E, W); v ═ V in graph G1,v2,…,vNThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, WijE W represents the node viAnd node vjWeight of the edge in between, wij=Sp(i, j); then clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;
in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as Sl(i, j) represents a matrix SlElement of ith row and jth column, Sp(i, j) represents a matrix SpRow i and column j.
9. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, wherein the computer program, when executed by the processor, causes the processor to implement the method of any of claims 1-7.
10. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN201911097854.5A 2019-11-12 2019-11-12 Single cell clustering method and device, electronic equipment and storage medium Active CN110827921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911097854.5A CN110827921B (en) 2019-11-12 2019-11-12 Single cell clustering method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911097854.5A CN110827921B (en) 2019-11-12 2019-11-12 Single cell clustering method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110827921A CN110827921A (en) 2020-02-21
CN110827921B true CN110827921B (en) 2022-06-14

Family

ID=69554127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911097854.5A Active CN110827921B (en) 2019-11-12 2019-11-12 Single cell clustering method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110827921B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112098850B (en) * 2020-09-21 2024-03-08 山东工商学院 Lithium ion battery voltage fault diagnosis method and system based on SDO algorithm
CN112750502B (en) * 2021-01-18 2022-04-15 中南大学 Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
CN112820353B (en) * 2021-01-22 2023-10-03 中山大学 Method and system for analyzing cell fate conversion key transcription factors
CN113151425B (en) * 2021-04-08 2023-01-06 中国计量科学研究院 Single cell sequencing method for improving accuracy based on key indexes
CN113257365B (en) * 2021-05-26 2022-07-12 南开大学 Clustering method and system for non-standardized single-cell transcriptome sequencing data
CN113257364B (en) * 2021-05-26 2022-07-12 南开大学 Single cell transcriptome sequencing data clustering method and system based on multi-objective evolution
CN113674800B (en) * 2021-08-25 2022-02-08 中国农业科学院蔬菜花卉研究所 Cell clustering method based on single cell transcriptome sequencing data
CN115527610B (en) * 2022-11-09 2023-11-24 上海交通大学 Cluster analysis method for single-cell histology data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102952854A (en) * 2011-08-25 2013-03-06 深圳华大基因科技有限公司 Single cell sorting and screening method and device thereof
CN106341258A (en) * 2016-08-23 2017-01-18 浙江工业大学 Method of predicting unknown connecting sides of network based on second-order local community and seed node structure information
CN109979538A (en) * 2019-03-28 2019-07-05 广州基迪奥生物科技有限公司 A kind of analysis method based on the unicellular transcript profile sequencing data of 10X
CN110222745A (en) * 2019-05-24 2019-09-10 中南大学 A kind of cell type identification method based on similarity-based learning and its enhancing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3268870A4 (en) * 2015-03-11 2018-12-05 Ayasdi, Inc. Systems and methods for predicting outcomes using a prediction learning model
US10347365B2 (en) * 2017-02-08 2019-07-09 10X Genomics, Inc. Systems and methods for visualizing a pattern in a dataset

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102952854A (en) * 2011-08-25 2013-03-06 深圳华大基因科技有限公司 Single cell sorting and screening method and device thereof
CN106341258A (en) * 2016-08-23 2017-01-18 浙江工业大学 Method of predicting unknown connecting sides of network based on second-order local community and seed node structure information
CN109979538A (en) * 2019-03-28 2019-07-05 广州基迪奥生物科技有限公司 A kind of analysis method based on the unicellular transcript profile sequencing data of 10X
CN110222745A (en) * 2019-05-24 2019-09-10 中南大学 A kind of cell type identification method based on similarity-based learning and its enhancing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A gene rank based approach for single cell similarity assessment and clustering;Yunpei Xu; Hong-Dong Li; Yi Pan; Feng Luo; Fang-Xiang Wu; Jianxi;《IEEE》;20190729;全文 *
Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning;Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti & Serafim;《Nature methods》;20170306;全文 *
基因表达谱数据分析方法研究与应用;李正军;《中国优秀硕士学位论文全文数据库-中国优秀硕士学位论文全文数据库-基础科学辑》;20170315;全文 *
面向大规模数据的聚类算法研究及应用;金冉;《中国博士学位论文全文数据库-信息科技辑》;20151115;全文 *

Also Published As

Publication number Publication date
CN110827921A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110827921B (en) Single cell clustering method and device, electronic equipment and storage medium
Zhao et al. Spectral feature selection for data mining
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
Cai et al. Facets: Fast comprehensive mining of coevolving high-order time series
US11971892B2 (en) Methods for stratified sampling-based query execution
Zhou et al. ECMdd: Evidential c-medoids clustering with multiple prototypes
CN113449802A (en) Graph classification method and device based on multi-granularity mutual information maximization
Safarinejadian et al. A distributed EM algorithm to estimate the parameters of a finite mixture of components
CN114898167A (en) Multi-view subspace clustering method and system based on inter-view difference detection
CN113435101B (en) Particle swarm optimization-based power failure prediction method for support vector machine
US11829442B2 (en) Methods and systems for efficient batch active learning of a deep neural network
Hsieh et al. Adaptive structural co-regularization for unsupervised multi-view feature selection
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
Ma et al. Non-traditional spectral clustering algorithms for the detection of community structure in complex networks: a comparative analysis
CN112967755B (en) Cell type identification method for single-cell RNA sequencing data
CN113408652A (en) Semi-supervised learning image classification method based on group representation features
Egidi et al. pivmet: Pivotal methods for Bayesian relabelling and k-means clustering
Daniel Machine learning for nonlinear model order reduction
CN117437973B (en) Single cell transcriptome sequencing data interpolation method
Bhat et al. OTU clustering: A window to analyse uncultured microbial world
Yang et al. Detecting communities in attributed networks through bi-direction penalized clustering and its application
Rosyid et al. Optimizing K-Means Initial Number of Cluster Based Heuristic Approach: Literature Review Analysis Perspective
Yuan et al. A novel automatic grouping algorithm for feature selection
CN117457110A (en) Protein solubility prediction method, computer device, and computer storage medium
CN115240771A (en) Protein prediction model training method, prediction method and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant