CN110827921B

CN110827921B - Single cell clustering method and device, electronic equipment and storage medium

Info

Publication number: CN110827921B
Application number: CN201911097854.5A
Authority: CN
Inventors: 朱晓姝; 彭小清; 李洪东; 王建新; 郭立渌; 李剑
Original assignee: Central South University; Yulin Normal University
Current assignee: Central South University; Yulin Normal University
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2022-06-14
Anticipated expiration: 2039-11-12
Also published as: CN110827921A

Abstract

The invention discloses a single cell clustering method, a single cell clustering device, electronic equipment and a storage medium. And constructing a global feature space on the basis of calculating the local similarity between the node pairs based on the distance information. And calculating the global similarity between the node pairs by using a multi-core learning method based on the global feature space. And then, expanding nodes on all second-order paths of the considered node pairs, adding more related node information, and constructing a more effective global similarity calculation method. And finally, sequencing the nodes according to the node degrees, and determining the order of the nodes initially added into the community, so that the Louvain community detection method is improved, and clustering is performed by using the method. The method is simple and effective, and compared with other methods, tests on a common single-cell transcriptome sequencing data set show that the method has better prediction performance in the aspect of single-cell transcriptome sequencing data clustering.

Description

Single cell clustering method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of bioinformatics, in particular to a single cell clustering method and a single cell clustering device, which are used for identifying cell types and providing a basis for analyzing a cell differentiation process.

Background

The traditional large-scale cell sequencing method is difficult to be applied to the research field needing to consider the individual characteristics of cells. And single-cell transcriptome sequencing data (scRNA-seq data) can be better suitable for researching cell difference and identifying cell types in the process of researching cell differentiation. With the rapid development of the scRNA-seq technology, the scRNA-seq data can more accurately reflect the gene expression data of each cell, and reduce the influence of different cells on controlling gene expression, cell behavior and cell types. However, scRNA-seq data is characterized by high dimensionality, small samples, and lack of a priori knowledge.

At present, two major methods of cell type identification, namely supervised learning and unsupervised learning, are mainly used as a method for identifying cell types (single cell clustering) based on scRNA-seq data. A supervised learning method is adopted for cell classification research, the performance strongly depends on a regularization strategy and priori knowledge, and the bottleneck that the prior knowledge needs to be enriched is faced for the current unknown cell clustering research. The unsupervised learning method is adopted for cell clustering research, prior knowledge is not needed, automatic estimation of classification category number can be achieved, and further research is needed in the aspects of category number estimation, clustering accuracy and robustness. Therefore, there is a need to design a method and apparatus for identifying cell types (single cell clustering) to improve the accuracy and robustness thereof.

Disclosure of Invention

The invention solves the technical problem that aiming at the defects of the prior art, the invention provides a single cell clustering method and a single cell clustering device, which can automatically estimate the clustering category number on the premise of lacking prior knowledge, fully utilize global similarity information and improve the clustering performance; simple and effective, and easy to implement.

The technical scheme provided by the invention is as follows:

a single cell clustering method, comprising the steps of:

step1, based on gene expression matrix (i.e. single cell transcription)Gene expression matrix of group sequencing data, wherein the gene expression matrix is single cell and is used for behavior gene expression quantity, and can be downloaded from open database), calculating similarity between single cell pairs, and constructing global characteristic space matrix S_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) indicates the similarity (local similarity) of single cell i and single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

step2, based on the matrix S_lCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function to construct a sparse global similarity matrix S_g；

Step3, based on the matrix S_gDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on paths_p；

Step 4, firstly based on the matrix S_pConstructing graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); similarity matrix S_pIs the similarity matrix of the nodes in the graph G, and is also the weight matrix of the edges in the graph G; then clustering each node in the graph G to complete single cell clustering;

in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as S_l(i, j) represents a matrix S_lElement of ith row and jth column, S_p(i, j) represents a matrix S_pRow i and column j.

A single cell clustering device comprises the following data acquisition module, a similarity matrix construction module and a clustering module:

the data acquisition module is used for acquiring gene expression data;

a similarity matrix construction module for constructing a path-based global similarity matrix S_pThe implementation process is as follows:

step1, based on the gene expression matrix, calculating the similarity between the single cell pairs and constructing a global characteristic space matrix S_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) represents the similarity (local similarity) of single cell i and single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

Step3, based on matrix S_gDetermining the nearest neighbor of each single cell; calculating the similarity between the single cell pairs based on the sum of the common nearest neighbor between the single cell pairs and the distance between the common nearest neighbor and the single cell pairs, and constructing a global similarity matrix S based on paths_p；

The clustering module is used for realizing single cell clustering, and the realization process is as follows:

step 4, firstly based on the matrix S_pConstruct graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); similarity matrix S_pIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then, clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;

in the above steps, X (i, j) represents an element in the ith row and the jth column of the matrix X, and i, j is 1,2, …, N; such as S_l(i, j) represents a matrix S_lElement of ith row and jth column, S_p(i, j) represents a matrix S_pRow ith and column jth.

An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the method of single-cell clustering as described above.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the single-cell clustering method described above.

Has the advantages that:

the invention is based on the assumption that global information is more favorable for accurately calculating the similarity than local information, and more associated node information and global information of the graph are fused. And firstly, calculating the local distance between the nodes so as to construct a global feature space. And then, selecting KNN by using a multi-core similarity calculation method, removing the nodes with weak correlation, calculating the global similarity among the nodes based on the global feature space, and denoising to obtain a sparse global similarity matrix. And then, a path-based similarity calculation method is adopted to extend all nodes on the second-order path length, so that a more effective global similarity calculation method is constructed. And finally, clustering is carried out based on an improved Louvain community detection method (the initial community adding sequence of the nodes is determined according to the degree sequence of the nodes). The invention can fully utilize global information and improve clustering performance. The clustering method can effectively find the cell subgroup. The invention is simple and effective, and compared with other methods, tests on a public data set show that the invention has better prediction performance in the aspect of single-cell transcriptome sequencing data clustering.

Drawings

FIG. 1 is a flowchart of an embodiment of the present invention (Multi-kernel and path-based global similarity _ Louvain, MPGS _ Louvain).

FIG. 2 is a clustering result heatmap based on the Deng data set, comparing clustering performance with different path lengths for embodiments of the present invention. Fig. 2(a) is a clustering result heatmap with a path length of 2, and fig. 2(b) is a clustering result heatmap with a path length of 3.

Fig. 3 is a chart of clustering results heatmap based on Pollen dataset comparing clustering performance for different path lengths for embodiments of the present invention. Fig. 3(a) is a clustering result heatmap with a path length of 2, and fig. 3(b) is a clustering result heatmap with a path length of 3.

FIG. 4 is a clustering result heatmap based on Goolam datasets comparing clustering performance for embodiments of the present invention with different path lengths. Fig. 4(a) is a clustering result heat map with a path length of 2, and fig. 4(b) is a clustering result heat map with a path length of 3.

FIG. 5 is a similarity results heatmap based on the Deng dataset comparing different similarity calculation methods with embodiments of the present invention. FIG. 5(a) is Euclidean distance, FIG. 5(b) is Spearman correlation coefficient, FIG. 5(c) is SIMLR, and FIG. 5(d) is MPGS.

Fig. 6 compares the various similarity calculation methods to embodiments of the present invention based on the similarity results heatmap of the Pollen dataset. FIG. 6(a) is Euclidean distance, FIG. 6(b) is Spearman correlation coefficient, FIG. 6(c) is SIMLR, and FIG. 6(d) is MPGS.

Fig. 7 compares various similarity calculation methods to embodiments of the present invention based on a similarity results heatmap of the gooolam dataset. FIG. 7(a) is Euclidean distance, FIG. 7(b) is Spearman correlation coefficient, FIG. 7(c) is SIMLR, and FIG. 7(d) is MPGS.

FIG. 8 compares the clustering method for different scRNA-seq data with the present invention examples based on NMI and ARI evaluation index. Fig. 8(a) is a box plot of the clustering result NMI values of different clustering methods, and fig. 8(b) is a box plot of the clustering result ARI values of different clustering methods.

Fig. 9 is a comparison of clustering performance for an embodiment of the present invention for the case of artificial perturbation based on NMI and ARI evaluation indices. Fig. 9(a) is an NMI comparison of manual perturbation to clustering results, and fig. 9(b) is an ARI comparison of manual perturbation to clustering results.

Detailed Description

The present invention will be further described in detail with reference to the drawings and specific examples.

Example 1:

the embodiment provides a single cell clustering method, which comprises the following steps:

step1, Gene expression matrix based (i.e., Gene expression from Single cell transcriptome sequencing data)Matrix, wherein the single cell is listed, the expression quantity of the behavior gene can be downloaded from a public database), the similarity between the single cell pairs is calculated, and a global characteristic space matrix S is constructed_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) indicates the similarity (local similarity) of single cell i and single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

Step 4, firstly based on the matrix S_pConstructing graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); similarity matrix S_pIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then clustering each node in the graph G to complete single cell clustering;

The graph G in step 4 can also be obtained by the following method: firstly, taking a gene expression matrix of scRNA-seq data as input, taking cells as nodes, taking gene expression data as attributes of the nodes, and connecting every two nodes through an edge to construct a complete graph; then according to the stepsStep 1-3, calculating to obtain a global similarity matrix S_p(ii) a Then according to S_pThe element values in (1) update the weights of the corresponding edges in the complete graph; and finally, removing the edge with the weight of 0 in the graph to obtain the graph G in the step 4.

Example 2:

based on embodiment 1, in step1, based on the gene expression matrix, the single cell clustering method of this embodiment calculates Spearman correlation coefficient (Spearman correlation coefficient) between column vectors corresponding to two single cells as their similarity; the specific steps of calculating the Spearman correlation coefficient Rs (i, j) of the column vectors corresponding to the single cell i and the single cell j are as follows:

step 1: converting elements S (M, i) and S (M, j) corresponding to column vectors S (: i) and S (: j) corresponding to the single cell i and the single cell j in a gene expression matrix S into ranks (descending positions) in the respective column vectors, which are denoted as R [ S (M, i) ] and R [ S (M, j) ], wherein M is 1,2, …, M is the number of genes, wherein the gene expression matrix S is a matrix of M rows and N columns, and N is the number of single cells;

step2 the difference between the elements S (m, i) and S (m, j) corresponding to the two column vectors S (: i) and S (: j) is calculated and added according to the following formula:

step 3: finally, the spearman correlation coefficient Rs between S (: i) and S (: j) is calculated according to the following formula:

finally order S_l(i,j)＝Rs(i,j)。

Example 3:

based on the single cell clustering method in this embodiment 2, the step2 is specifically realized by the following steps:

step 2.1, based on the global characteristic spaceSpace matrix S_lDetermining K Nearest Neighbors (KNN) of each single cell according to the size of the element value of each row in the matrix; for a single cell i, matrix S_lIth row S_l(i), (ii) removing S_l(i, i) the single cells corresponding to the maximum K elements except the single cells are K nearest neighbors, and a set formed by the K nearest neighbors is recorded as KNN (i); wherein S_l(i, i) denotes S_l(i) the i-th element, i.e. S_lRow i, column i; the above operation is used to apply to the global feature space matrix S_lFiltering out nodes with weak correlation on the basis;

step 2.2, calculating the weighted Gaussian kernel similarity D (i, j) between each single cell pair by using the weighted Gaussian kernel function as the similarity S of the single cells_g(i, j) to obtain a similarity matrix S_g(ii) a The similarity between pairs of individual cells is calculated using a weighted Gaussian kernel function to filter data noise, matrix S_gThe global similarity between single cells can be reflected, and is a sparse global similarity matrix;

in this step, the mean value mu of the distances between the single cell i and the single cell j and the K nearest neighbors thereof is calculated respectively_iAnd mu_jThen measured by mu_iAnd mu_jThe width parameter of the gaussian kernel function is determined on the basis of the mean value of (a).

Using a weighted gaussian kernel function, the method of calculating a weighted gaussian kernel similarity D (i, j) between a single cell i and a single cell j is:

wherein, ω is_lExpressing the ith Gaussian kernel function K_l(s_i,s_j) The weight of (2) can be taken according to experience; epsilon_ijThe width parameter is a width parameter of the Gaussian kernel function and is determined by a parameter pair (sigma, K), and different width parameters are determined by different parameter pair values, so that different Gaussian kernel functions are obtained; s (: i) and S (: j) represent the ith and jth columns of the gene expression matrix, respectively.

Example 4:

the single cell clustering method of this example is based on example 3, and ω in step 2.2 is determined by the following steps_lThe value of (A) is as follows:

firstly, omega is firstly_l1,2, …, G is initialized to

Wherein G is the number of Gaussian kernel functions; then the optimal solution is obtained by iteratively solving the following objective functions:

constraint conditions are as follows:

L^TL＝I_C

wherein, I_NIs an NxN identity matrix, I_cIs a C multiplied by C unit matrix, C is a classification number, L is an Nmultiplied by C rank constraint matrix, beta, gamma and rho are non-negative empirical parameters,

the Frobenius norm of a matrix is referred to as F-norm, and D is a matrix with D (i, j) as the element of the ith row and the jth column.

Example 5:

based on embodiment 3, in step 2.2, the single cell clustering method of this embodiment uses a weighted gaussian kernel function to calculate weighted gaussian kernel similarity D (i, j) between each single cell pair, and is implemented by SIMLR (single-cell interpretation via multi-kernel learning, which is a single cell interpretation tool based on multi-core learning, and can calculate similarity between cells).

Example 6:

in the single-cell clustering method of this embodiment, on the basis of embodiment 5, the value range of the parameter σ is set to {1.0,1.25,1.5,1.75,2}, and the value range of the parameter K is set to {10,12,14, once, 30}, so that 55 different parameter pair values can be obtained, 55 different width parameters are obtained, and 55 different gaussian kernel functions are obtained.

Example 7:

based on the single cell clustering method in this embodiment 6, the step3 is specifically realized by the following steps:

step 3.1, expanding the path length and determining a second-order path;

sparse-based global similarity matrix S_gRespectively determining K nearest neighbors of the single cell i and the single cell j according to the size of each row of element values in the matrix, and determining the common nearest neighbor of the single cell i and the single cell j; taking a nearest neighbor shared by the single cell i and the single cell j as an intermediate node, and connecting a path with the length of 2 of the single cell i and the single cell j as a second-order path;

step 3.2, constructing a global similarity matrix S based on the path based on the second-order path_p；

Calculating the sum of the distances between the intermediate nodes on all second-order paths connecting the single cell i and the single cell j and the single cell i and the single cell j, namely a global similarity matrix S_gThe sum of corresponding elements in the solution is calculated by the following formula:

computing

Wherein S is_g(i, k) representsMatrix S_gElement of ith row and kth column, S_g(j, k) represents a matrix S_gRow jth and column kth.

For the directly connected node pairs, the path length is expanded, namely the node range participating in the node similarity calculation is expanded, and the node range is expanded to the node on the longer path connecting the two nodes by only considering the two nodes, so that the calculated similarity data can better reflect the global information. Fig. 2-4 show the comparison of the clustering effect of the method on 3 commonly used data sets when different path lengths are taken (2 and 3, respectively). The result shows that when the path lengths are selected to be 2 and 3, the performance is similar, so that the expansion path length is selected to be 2 in the embodiment, the calculated similarity data can better reflect global information, and the calculation amount is also considered.

Example 8:

based on the single cell clustering method in this embodiment 7, the step 4 is specifically implemented by the following steps:

first, edges in the graph G are determined: if S_p(i, j) ≠ 0, then node v_iAnd node v_jThere is an edge between them, and the weight w of this edge_ij＝S_p(i, j), otherwise, node v_iAnd node v_jThere is no edge in between;

then, clustering the nodes in the graph G by using a Louvain community detection method, which comprises the following steps:

step 4.1, calculating the degree of each node in the graph G (the degree of one node is the number of edges connected with the node), sorting the nodes according to the degree, and sequentially recording the sorted nodes as v'₁,v′₂,…,v′_N；

Step 4.2, iteratively calculating a clustering result according to the modularity maximization principle; the method comprises the following specific steps:

1) initializing each node in the graph G into a community; calculating the modularity function value at the moment and storing the value into Q₁；

2) Let Q₂＝Q₁；

3) Let i equal to 1;

4) successive assumptionsV'_iThe community and each neighbor node (and v'_iNodes connected with edges, namely neighbor nodes) are merged, and the relative Q of the modularity function values under various merging modes is calculated respectively₂If Δ Q is greater than 0 in the presence of at least one merge mode, selecting the merge mode in which Δ Q is the greatest, and v 'in the merge mode'_iMerging the community with the community of the corresponding neighbor node, otherwise, not merging;

5) judging whether i is equal to N, if so, calculating the modularity function value at the moment and storing the modularity function value in Q₁And go to step 6), otherwise, let i ═ i +1, and return to step 4);

6) judging whether Q exists₁＞Q₂If yes, turning to the step 2) for iteration; otherwise, the modularity function value can not be increased any more and reaches the maximum, the iteration is ended, and the community division result at the moment is output, namely the subgroup obtained by clustering the single cells;

wherein the modularity function is:

in the formula, A_ijDenotes a connection node v'_iAnd v 'node'_jIs the weight of the side of if node v'_iAnd v 'node'_jThere is no edge connection between them, then A_ij＝0；K_iAnd K_jAre respectively represented by and node v'_iAnd node v'_jThe sum of the weights of the connected edges; c. C_iAnd c_jRespectively represent node v'_iAnd v 'node'_jThe community in which the user is located;

denotes c_iAnd c_jWhether it is the same community, if c_iAnd c_jAre the same community, and are the same,

otherwise

Is the sum of the weights of all edges in graph G.

According to the degree-centrality rule, the nodes with higher degree are more important, so the degree of each node is calculated in the step, the nodes are sorted according to the degree of the nodes, and then a sequenced Louvain community detection method is used according to the modularity maximization principle (namely in the iterative calculation process, the method for selecting the nodes randomly is changed into the method for selecting the nodes according to the node sequence, and is an improved Louvain community detection method), natural partitions are searched to realize clustering, and the clustering speed can be increased.

The specific implementation process of the single cell clustering method in this embodiment is shown in fig. 1.

Example 9:

the embodiment provides a single cell clustering device, which comprises a data acquisition module, a similarity matrix construction module and a clustering module, wherein the data acquisition module comprises:

the data acquisition module is used for acquiring gene expression data;

step1, calculating the similarity between single cell pairs based on a gene expression matrix, and constructing a global characteristic space matrix S_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) represents the similarity (local similarity) of single cell i and single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

step 4, firstly based on the matrix S_pConstructing graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); similarity matrix S_pIs the similarity matrix of the nodes in graph G, and is also the weight matrix of the edges in graph G; then, clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;

The implementation process of the steps can adopt the method given in any one of the above embodiments.

Example 10:

the present embodiment provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to implement the single cell clustering method described in any one of the above embodiments.

Example 11:

the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the single-cell clustering method described in any one of the above embodiments.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application.

Experimental verification

1. Evaluation index

The clusters predicted by the embodiment (MPGS _ Louvain) of the invention cannot be all the same as the clusters of the marks, and NMI and ARI are common clustering performance evaluation indexes and can be used for evaluating the consistency between the predicted clustering results and the clustering results of the known marks. Therefore NMI and ARI are suitable for evaluating the predictive clustering performance of MPGS.

In addition, in addition to NMI and ARI, heatmaps (heatmaps) can be used to visually reflect cell clustering results and related highly expressed genes, and are useful for evaluating the performance of similarity calculations. Therefore, the present experiment also uses heat maps to represent and evaluate the similarity calculation performance of the MPGS method.

2. Comparison with other similarity calculation methods

To evaluate the effectiveness of the clustering method proposed in the present embodiment, the similarity calculation Method (MPGS) in the present embodiment was compared with three other similarity calculation methods (SIMLR). Among them, Euclidean distance and Spearman correlation similarity calculation are classical similarity calculation methods, and SIMLR is a new similarity calculation method with high performance. The heat map of the results of the similarity calculations is shown in FIGS. 5-7, where (a) is Euclidean distance, (b) is Spearman correlation coefficient, (c) is SIMLR, and (d) is MPGS. In the heat map, white indicates a high similarity value and black indicates a low similarity value. As can be seen from FIGS. 5-7, direct use of the two classical similarity calculations Euclidean distance and Spearman correlation coefficient results in a heatmap in which the blocky structures are not evident. When the SIMLR and MPGS methods are used, the result heat map has more obvious block structures, wherein the block structures of the result heat map of the MPGS are clearer and have no abnormal points. Therefore, the method is superior to other similarity calculation methods.

3. Comparison with other clustering methods

In order to evaluate the effectiveness of the clustering method proposed by the present invention, the example of the present invention (MPGS _ Louvain) was compared with five other similarity calculation methods (NMF, SNN-cliq, SIMLR, SSE, SSNN-Louvain). The clustering experiment results are shown in fig. 8, in which fig. 8(a) is a box plot of the clustering results NMI values of different clustering methods, and fig. 8(b) is a box plot of the clustering results ARI values of different clustering methods. As can be seen from fig. 8, the MPGS method has the highest maximum value, median value, and two quartile values among the two evaluation indexes of NMI and ARI, and has no outlier. Therefore, the invention is superior to other clustering methods.

4. Effect of Artificial perturbation on the Performance of the invention

In order to analyze and evaluate the influence of manual disturbance on the clustering performance of the embodiment of the invention, the robustness of the embodiment of the invention is verified. Here, the experiment was run 100 times with 95% of the data selected using random seeds. And calculating the mean value and standard deviation of the clustering result NMI and ARI evaluation index, as shown in fig. 9, fig. 9(a) is the NMI comparison of the artificial disturbance to the clustering result, and fig. 9(b) is the ARI comparison of the artificial disturbance to the clustering result. As can be seen from fig. 9, the mean of 100 runs was very close to the result of MPGS, with standard deviations less than 5%. Therefore, the method is insensitive to manual disturbance and has better robustness.

Claims

1. A single cell clustering method is characterized by comprising the following steps:

step1, calculating the similarity between single cell pairs based on a gene expression matrix, and constructing a global characteristic space matrix S_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) indicates the similarity of a single cell i and a single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

step2, based on the matrix S_lComputing similarity between pairs of single cells using weighted Gaussian kernel functions to construct a sparse globalSimilarity matrix S_g；

Step 4, firstly based on the matrix S_pConstructing graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); then clustering each node in the graph G to complete single cell clustering;

in the above steps, X (i, j) represents the element in ith row and jth column of the matrix X, where i, j is 1,2, …, N; such as S_l(i, j) represents a matrix S_lElement of ith row and jth column, S_p(i, j) represents a matrix S_pRow i and column j.

2. The single cell clustering method according to claim 1, wherein in step1, the spearman correlation coefficient between the column vectors corresponding to two single cells is calculated as their similarity, i.e. S, based on the gene expression matrix_lCalculating the Spanish correlation coefficient Rs (i, j) of the column vectors of the single cell i and the single cell j in the gene expression matrix as S_l(i,j)，i,j＝1,2,…,N。

3. The method for single cell clustering according to claim 1, wherein the step2 comprises the steps of:

step 2.1, based on the global characteristic space matrix S_lDetermining K nearest neighbors of each single cell according to the size of each row of element values in the matrix; for a single cell i, matrix S_lIth row S_l(i), (ii) removing S_l(i, i) the single cell corresponding to the largest K elements except the K elements, namely the K nearest neighbors, and the K nearest neighbors form a setCombined as KNN (i);

step 2.2, calculating the weighted Gaussian kernel similarity between each single cell pair by using the weighted Gaussian kernel function, and taking the similarity as the similarity, namely S_gThe value of the corresponding element;

wherein, ω is_lExpressing the ith Gaussian kernel function K_l(s_i,s_j) Weight of (e ∈)_ijThe width parameter is a width parameter of the Gaussian kernel function and is determined by a parameter pair (sigma, K), and different width parameters are determined by different parameter pair values, so that different Gaussian kernel functions are obtained; s (: i) and S (: j) represent the ith and jth columns of the gene expression matrix, respectively;

order S_g(i,j)＝D(i,j)。

4. The method for single cell clustering according to claim 3, wherein in step 2.2, ω is first determined_lIs initialized to

constraint conditions are as follows:

L^TL＝I_C

the Frobenius norm of the matrix is denoted for short as the F-norm.

5. The single cell clustering method according to claim 3, wherein the value range of the parameter σ is set to {1.0,1.25,1.5,1.75,2}, and the value range of the parameter K is set to {10,12,14,.., 30}, so as to obtain 55 different parameter pairs, obtain 55 different width parameters, and obtain 55 different gaussian kernel functions.

6. The method for single cell clustering according to claim 1, wherein the step3 comprises the steps of:

step 3.1, determining the common nearest neighbor of each single cell pair;

first, based on the global similarity matrix S of sparse_gRespectively determining K nearest neighbors of the single cell i and the single cell j according to the size of each row of element values in the matrix; for a single cell i, matrix S_gIth row S_g(i,: removing S_g(i, i) the single cells corresponding to the maximum K elements except the single cells are K nearest neighbors, and a set formed by the K nearest neighbors is recorded as KNN (i); for single cellj, matrix S_gJ row S_g(j,: wherein S is removed_g(j, j) the single cells corresponding to the largest K elements except the single cells, namely K nearest neighbors, and the set formed by the K nearest neighbors is marked as KNN (j); then determining the nearest neighbor set KNN (i) and KNN (j) shared by the single cell i and the single cell j;

step 3.2, constructing a global similarity matrix S based on paths_p；

Calculating S according to the following formula_pElement S in ith row and jth column_p(i,j)：

Wherein S is_g(i, k) represents a matrix S_gElement of ith row and kth column, S_g(j, k) represents a matrix S_gRow jth and column kth.

7. The method for clustering single cells according to claim 1, wherein in the step 4, the method for determining the edges in the graph G is as follows: if S_p(i, j) ≠ 0, then node v_iAnd node v_jThere is an edge between them, and the weight w of this edge_ij＝S_p(i, j), otherwise, node v_iAnd node v_jThere is no edge in between;

in the step 4, clustering the nodes in the graph G by using a Louvain community detection method includes the following steps:

step 4.1, calculating the degree of each node in the graph G, sequencing the nodes according to the degree, and sequentially recording the sequenced nodes as v'₁,v′₂,…,v′_N；

2) Let Q₂＝Q₁；

3) Let i equal to 1;

4) let v 'be assumed in turn'_iThe community where the node is located is merged with the community where each neighbor node is located, and the relative Q of the modularity function values under various merging modes is calculated respectively₂If Δ Q is greater than 0 in the presence of at least one merge mode, selecting the merge mode in which Δ Q is the greatest, and v 'in the merge mode'_iMerging the located community with the community where the corresponding neighbor node is located, otherwise, not merging;

6) judging whether Q exists₁＞Q₂If yes, turning to the step 2) for iteration; otherwise, the modularity function value can not be increased any more and reaches the maximum, the iteration is ended, and the community division result at the moment is output, namely the single cell clustering result;

wherein the modularity function is:

in the formula, A_ijDenotes a connection node v'_iAnd v 'node'_jIs the weight of the side of if node v'_iAnd v 'node'_jThere is no edge connection between them, then A_ij＝0；K_iAnd K_jAre respectively represented by and node v'_iAnd v 'node'_jThe sum of the weights of the connected edges; c. C_iAnd c_jRespectively represent node v'_iAnd v 'node'_jThe community in which it is located;

otherwise

Is the sum of the weights of all edges in graph G.

8. The single cell clustering device is characterized by comprising the following data acquisition module, a similarity matrix construction module and a clustering module:

the data acquisition module is used for acquiring gene expression data;

step1, based on gene expression data, calculating the similarity between single cell pairs, and constructing a global characteristic space matrix S_l；S_lIs NXN, N is the number of single cells, wherein the element S_l(i, j) indicates the similarity of a single cell i and a single cell j, S_lEach row in the list represents the similarity between a single cell and all other single cells, S_lThe global similarity information is included and is a global feature space matrix;

step2, based on the matrix S_lCalculating the similarity between the single cell pairs by using a weighted Gaussian kernel function, and constructing a sparse global similarity matrix S_g；

step 4, firstly based on the matrix S_pConstructing graph G ═ (V, E, W); v ═ V in graph G₁,v₂,…,v_NThe method comprises the steps of (1) collecting nodes, wherein each node is a single cell; e is an edge set; w is the set of weights for the edge, W_ijE W represents the node v_iAnd node v_jWeight of the edge in between, w_ij＝S_p(i, j); then clustering each node in the graph G by using a Louvain community detection method, namely completing single cell clustering;

9. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, wherein the computer program, when executed by the processor, causes the processor to implement the method of any of claims 1-7.

10. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.