CN114708918A - Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis - Google Patents

Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis Download PDF

Info

Publication number
CN114708918A
CN114708918A CN202210414444.4A CN202210414444A CN114708918A CN 114708918 A CN114708918 A CN 114708918A CN 202210414444 A CN202210414444 A CN 202210414444A CN 114708918 A CN114708918 A CN 114708918A
Authority
CN
China
Prior art keywords
data set
data
node
cells
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210414444.4A
Other languages
Chinese (zh)
Inventor
高俊晓
李楠
殷鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210414444.4A priority Critical patent/CN114708918A/en
Publication of CN114708918A publication Critical patent/CN114708918A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a system for identifying rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis, which are characterized in that a public data set of human lung tissues is obtained, the data set is preprocessed, the preprocessed data set is subjected to data fusion and batch effects are removed, the fused data set is clustered, the cell types of cells in each group are marked, macrophages are screened, and the rare macrophage subsets and differential expression genes thereof in the macrophages are identified. Provides reference for disease diagnosis.

Description

Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis
Technical Field
The invention relates to the technical field of biological information, in particular to a method and a system for identifying rare macrophage subgroups and disease markers in idiopathic pulmonary fibrosis.
Background
Idiopathic Pulmonary Fibrosis (IPF) is a chronic, progressive, irreversible interstitial lung disease with scarring of the lungs. Currently the cause of the disease is unclear, there is no effective treatment regimen, and high morbidity and mortality are associated with the disease. The conventional pathological analysis aiming at IPF is based on the conventional tissue sequencing data, but the conventional pathological analysis only can provide an average expression signal of all cells in the tissue and cannot finely analyze the expression difference between different cells in the lung tissue of the IPF. The advent of single cell technology has provided a higher resolution for the study of IPF. The expression situation between different cell types can be further explored through data analysis of single cell sequencing data.
Mapping genotypes to phenotypes is one of the long-standing challenges in biology and medicine. One effective strategy to address this problem is to perform transcriptome analysis. However, although all cells in our body have nearly the same gene type, the transcriptome information of any one cell reflects only a portion of the gene activity. Furthermore, since many different types of cells in our body express a unique transcriptome, conventional tissue sequencing can only provide an average expression signal for all cells in a tissue. And there is increasing evidence that gene expression is different even in the same cell type. Based on this, a more precise understanding of the transcriptome in individual cells is crucial to elucidating their role in cell function and to understand how gene expression promotes beneficial or deleterious states. With the proposal of the single cell technology in 2009, the problem is effectively solved. The single cell technology improves the resolution of tissue and disease research to a higher level by extracting single cells in the tissue, sequencing and analyzing the single cells, and provides a new dimension for the research of the tissue and the disease.
Single cell technology is currently widely used in IPF research, and the main goal of the research is to identify new cell subtypes and to study the mechanisms of drug treatment of disease. As found in the Reyfman et al study, rare cell populations including respiratory stem cells and senescent cells appeared during fibrosis and 2 distinct macrophage subsets were identified in IPF samples. Xu et al found that in explant tissue from patients receiving terminal IPF transplantation, there was a subpopulation of atypical alveolar epithelial cells expressing a respiratory-related gene. Adams et al found alterations in IPF lung endothelial cells, which found a population of vascular endothelial cells expressing COL15A1 in IPF lung tissue. In the Kwapiszewska et al study, lung homogenates and lung fibroblasts with and without pirfenidone administration were analyzed by single cell technique and pirfenidone was found to exert a beneficial effect on multiple pathways of fibroblasts and other lung tissue cells. In the study of Sheu et al, IPF fibroblasts from patients taking Nintenbu were analyzed to find over 100 abnormal up-regulated genes, which are mainly involved in cell cycle pathway and inhibition of fibroblast proliferation.
Many current single cell studies focus primarily on identifying distinct cell subtypes in tissues, and little in-depth research is conducted on a particular cell type.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying rare macrophage subgroups and disease markers in idiopathic pulmonary fibrosis, which at least solve the defect that the existing single cell data analysis task only stays in identifying cell subtypes from a tissue sample, but the deep research on a certain specific cell type is rarely carried out.
According to an embodiment of the present invention, a method for identifying rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis is provided, comprising the following steps:
step S10: acquiring a public data set of human lung tissue, and preprocessing the data set;
step S20: performing data fusion on the preprocessed data set and removing batch effect;
step S30: clustering the fused data set, marking the cell type of each group of cells and screening macrophages;
step S40: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
In some embodiments, in step S10, the step of acquiring a public data set of human lung tissue and preprocessing the data set specifically includes the following steps:
step S11: selecting 4 data sets from a GEO database, namely GSE136831, GSE135893, GSE128033 and GSE122960, wherein the 4 data sets extract human lung tissues and separate single cells for sequencing;
step S12: performing a data wash on each of the data sets, the data wash comprising removing genes that are not expressed in any cell and removing cells that express genes with a gene factor of less than 200 and cells that express mitochondria-related genes in a proportion of greater than 25%;
step S13: and carrying out data preprocessing on the cleaned data sets, wherein the preprocessing comprises Log Normalization data standardization, data scaling, cell cycle signal removal and 8000 highly-variable genes selected for each data set so as to carry out subsequent data integration.
In some embodiments, in step S20, the step of performing data fusion on the preprocessed data set and removing the batch effect specifically includes the following steps:
step S21: finding projections of a plurality of said data sets by a canonical correlation analysis algorithm maximizes the correlation between all data sets;
step S22: a dynamic time plan adjustment algorithm is used to determine the best mapping between the data sets.
In some embodiments, in step S21, the step of finding the projections of the plurality of data sets by the typical correlation analysis algorithm to maximize the correlation between all the data sets includes the following steps:
step S211: performing singular value decomposition on the high mutation gene expression matrix of each data set to obtain an initial typical correlation vector CCV;
step S212: then, the typical correlation vector W corresponding to each data set is updated by performing singular value decomposition on each data set and other data sets until the difference ratio of W before and after updating is greater than a given threshold value.
In some embodiments, in step S22, the step of determining the optimal mapping between the data sets using the dynamic time programming adjustment algorithm specifically includes the following steps:
in a data set of a single cell, computing a warped path between the data sets using the dynamic time programming adjustment algorithm to minimize the distance between the data sets, wherein W (W1, W2 … wk) is a warped matrix, wherein each vector therein corresponds to a point in the warped path that maps elements in data set X into data set Y and minimizes the distance between them; mapping the typical correlation vector into the warp matrix to achieve alignment of the two datasets into a low dimensional space;
for more than two data sets, taking the data set with the largest number of cells as a reference data set; the other data sets are respectively aligned with the reference data set, and finally the typical correlation vectors of each data set are normalized to a common calibration space defined with the reference data set.
In some embodiments, the step of clustering the fused data set and labeling the cell type of each cell group and screening macrophages in step S30 specifically comprises the following steps: clustering the fused data set by adopting a clustering analysis algorithm based on community detection, marking the cell type of each group of cells and screening macrophages, and the method comprises the following steps:
step S31: taking the data after dimensionality reduction as input data, taking the cells as nodes, and calculating the Euclidean distance between each node through a KNNG algorithm to determine the K-nearest neighbor of each node;
step S32: calculating neighborhood overlap between each node and K nearest neighbor nodes thereof by adopting the following formula to construct a shared nearest neighbor matrix, wherein A and B represent the number of 'neighbor' nodes of the two nodes: j (a, B) ═ a ≠ B |)/(| a | + | B | - | a |);
step S33: calculating modularity for each node and measuring module profit after the node is added into the module by scanning the node and neighbor nodes thereof by adopting a Louvain algorithm, selecting the neighbor node with the maximum profit to be added into the module, and repeating iteration to finally form a node cluster;
step S34: differential analysis was performed using Wilcoxon rank-sum test and cell types were judged and macrophages were screened based on the genes differentially upregulated for each population of cells.
In some embodiments, in step S33, a Louvain algorithm is used to calculate modularity for each node and measure the module gain after the node is added to the module by scanning the node and its neighbor nodes, the neighbor node with the largest gain is selected to be added to the module, and the iteration is repeated, and the step of finally forming the node cluster specifically includes the following steps:
and calculating modularity for each node by adopting a Louvain algorithm through scanning the nodes and neighbor nodes thereof and adopting the following formula, wherein: m is the sum of the weights of the edges in the graph, i, j represents two nodes, Aij represents the weight between the two nodes, the weight is calculated by SNN, ki, kj represents the sum of the weights of all the edges of the nodes i, j, ci, cj represents the group to which the nodes i, j belong, and delta is a Kronecker delta function):
Figure BDA0003604930810000051
to maximize the Q value, the luvain algorithm calculates the modularity gain and iterates repeatedly, as shown in the formula, where Σ in is the internal node weight sum of the community into which node i enters, Σ tot is the sum of weights of node i and the connected edges of all communities in the graph, ki, in is the sum of weights of the edges between nodes i and i to be moved into the community:
Figure BDA0003604930810000052
in some embodiments, in step S40, the step of identifying the rare macrophage subtypes and their differentially expressed genes in the macrophages comprises the steps of:
identifying 3 groups of rare macrophage subpopulations consisting of IPF samples, 3 groups of cells being designated Cluster 0, Cluster 1, Cluster 2, said 3 groups of cells all expressing IPF markers, namely SPP1, CCL2, FABP4, hit1, wherein: wherein: the SPP1 is a gene for coding osteopontin, can remarkably promote migration and proliferation of fibroblasts and epithelial cells, and can be used as a disease IPF marker; CCL2 is chemokine 2, is capable of recruiting mononuclear macrophages, is capable of promoting fibrosis through various mechanisms involving inflammation, angiogenesis and myofibroblast accumulation, and can serve as a marker for IPF disease; the FABP4 encoded fatty acid binding protein is a cytoplasmic fatty acid chaperone protein, is expressed in fat cells and myeloid cells, can promote ATP polarized to M1 type by macrophages, and is involved in generation of IPF by promoting activation of M1 type macrophages; the CHIT1 encodes chittriosidase, has a profibrotic effect and is significantly expressed in the lungs of IPF patients.
According to another embodiment of the present invention, there is provided an identification system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis, comprising:
a data set acquisition unit: acquiring a public data set of human lung tissue, and preprocessing the data set;
a pretreatment unit: performing data fusion on the preprocessed data set and removing batch effect;
screening unit: clustering the fused data set, marking the cell type of each group of cells and screening macrophages;
an identification unit: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
A storage medium storing a program file capable of implementing the method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis as set forth above.
A processor for executing a program, wherein the program when executed performs any of the methods for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis.
The rare macrophage subset and disease marker identification method and system in the idiopathic pulmonary fibrosis, provided by the invention, can be used for identifying the rare macrophage subset and the disease marker by acquiring the public data set of human lung tissue, preprocessing the data set, performing data fusion on the preprocessed data set, removing batch effect, clustering the fused data set, marking the cell type of each cell group, screening macrophages, and identifying the rare macrophage subset and the differential expression genes thereof in the macrophages, so that the rare macrophage subset and disease marker identification method and system in the idiopathic pulmonary fibrosis can avoid the existing single cell data analysis task from only identifying the cell subset from a tissue sample, further analyzing the specific cell type on the basis, can identify the rare macrophage subset, and further identify the IPF disease marker, provides reference for disease diagnosis.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a flowchart illustrating the steps of the method for identifying rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis according to example 1 of the present invention;
FIG. 2 is a flowchart of the steps of obtaining a public data set of human lung tissue and preprocessing the data set according to embodiment 1 of the present invention;
fig. 3 is a flowchart of a step of performing data fusion on the preprocessed data set and removing a batch effect according to embodiment 1 of the present invention;
FIG. 4 is a flow chart of the steps provided in example 1 of the present invention for clustering the fused data sets and labeling the cell types of each population of cells and screening for macrophages.
FIG. 5 shows Cluster 0,1,2 differentially upregulated gene expression Heatmap provided in example 1 of the present invention.
FIG. 6 is a graph showing the expression profiles of the key markers SPP1, CCL2, FABP4 and CHIT1 in 3-group macrophages, which are provided in example 1 of the present invention.
Fig. 7 is a structural diagram of the identification system for rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis provided in example 2 of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Referring to fig. 1, according to an embodiment of the present invention, a method for identifying rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis is provided, which includes the following steps S10-S40, and the implementation of each step is described in detail below.
Step S10: a public dataset of human lung tissue is acquired and preprocessed.
Referring to fig. 2, a flowchart of steps for acquiring a public data set of a human lung tissue and preprocessing the data set provided in this embodiment includes the following steps S11 to S13, and an implementation of each step is described in detail below.
Step S11: from the GEO database, 4 data sets were selected, GSE136831, GSE135893, GSE128033 and GSE122960, respectively, which all extracted human lung tissue and isolated single cells for sequencing.
Please refer to table 1 below, which provides 4 data sets for this embodiment.
Table 14 data sets provided for the examples
Figure BDA0003604930810000091
Step S12: data washing was performed on each of the above data sets, including removal of genes that were not expressed in any cells and removal of cells that expressed genes with a gene factor of less than 200 and cells that expressed mitochondria-related genes in a proportion of greater than 25%.
Step S13: and carrying out data preprocessing on the cleaned data sets, wherein the preprocessing comprises Log Normalization data standardization, data scaling, cell cycle signal removal and 8000 highly-variable genes selected for each data set so as to carry out subsequent data integration.
Step S20: and performing data fusion on the preprocessed data set and removing batch effect.
It will be appreciated that after the high mutation gene expression data for each data set has been summarized in the above steps, all data sets are then integrated into one data set for subsequent analysis. Due to non-biological factor differences in these data sets, such as differences in experimental platform, technology, personnel, date, reagents, and sample selection, batch effects may be introduced during the integration of these data sets.
Referring to fig. 3, a flowchart of steps of performing data fusion on the preprocessed data set and removing the batch effect in this embodiment is shown, and specifically includes the following steps S21 to S22, and an implementation of each step is described in detail below.
Step S21: finding projections of a plurality of said data sets by means of a typical correlation analysis algorithm maximizes the correlation between all data sets.
It is understood that the core of a typical correlation analysis algorithm (CCA) is to maximize the correlation between all data sets by finding the projections of multiple data sets, and the specific formula is as follows, where n represents different data sets and w corresponds to a typical correlation vector CCV in each data set. Xi,Xi,wi,wjRepresents two different datasets and their corresponding CCVs, respectively:
Figure BDA0003604930810000101
the corresponding typical correlation vector W (W1, W2 … wn) for each data set is solved by an iterative algorithm.
Specifically, the step of finding the projections of a plurality of the data sets by using a typical correlation analysis algorithm to maximize the correlation between all the data sets specifically includes the following steps S211 to S212, and the implementation of each step is described in detail below.
Step S211: performing singular value decomposition on the high mutation gene expression matrix of each data set to obtain an initial typical correlation vector CCV;
step S212: and then updating the corresponding typical correlation vector W of each data set by performing singular value decomposition on each data set and other data sets until the difference ratio of W before and after updating is larger than a given threshold value.
The specific formula is as follows, where n represents different data sets, w represents typical correlation vectors CCV, Γ and Δ represent projection vectors of different data sets, Λ represents a feature vector, and SVD is singular value decomposition of the data set X, and the obtained result is initialized with w. When the difference ratio of w is less than the threshold 103, updating w by performing singular value decomposition on the data set X and the other data set k until the difference ratio is greater than the threshold:
initializen←Δn,[Γnnn]←SVD(Xn)
Figure BDA0003604930810000102
step S22, the step of determining the optimal mapping between the data sets using the dynamic time programming adjustment algorithm specifically includes the following steps:
in a data set of a single cell, computing a warped path between the data sets using the dynamic time programming adjustment algorithm to minimize the distance between the data sets, wherein W (W1, W2 … wk) is a warped matrix, wherein each vector therein corresponds to a point in the warped path that maps elements in data set X into data set Y and minimizes the distance between them; mapping the representative correlation vector into the warp matrix to achieve alignment of the two datasets into a low dimensional space;
for more than two data sets, taking the data set with the largest number of cells as a reference data set; aligning the other data sets with the reference data set respectively, and finally normalizing the typical correlation vector of each data set to a common calibration space defined with the reference data set, the common calibration space being defined as follows, wherein W is a warping matrix, and W iskRepresents the warp vector:
DTW(X,Y)=minW[∑k(wk)]。
step S30, clustering the fused data set and labeling the cell type of each population of cells and screening macrophages.
Referring to fig. 4, a flowchart of the steps of clustering the fused data set, labeling the cell type of each cell group, and screening macrophages is provided in this embodiment, and specifically includes the following steps S31 to S34, and the detailed implementation of each step is described below.
Step S31: and taking the data after dimensionality reduction as input data, taking the cells as nodes, and calculating the Euclidean distance between each node by a KNNG algorithm to determine the K-nearest neighbor of each node.
In this embodiment, a single cell map is constructed by a KNNG (K-Nearest Neighbor graph) algorithm, data after dimensionality reduction is used as input data, cells are used as nodes, and euclidean distances between each node are calculated to determine a K-Nearest Neighbor (KNN, K defaults to 20) of each node.
Step S32: calculating neighborhood overlap between each node and K nearest neighbor nodes thereof by adopting the following formula to construct a shared nearest neighbor matrix, wherein A and B represent the number of 'neighbor' nodes of the two nodes: j (a, B) ═ (| a ═ B |)/(| a |) (| a ═ B |)/(| a | + | B | - | a |) B |).
In this embodiment, after the KNNG construction is completed, the neighborhood overlap (Jaccard index) between each node and its K Nearest neighbor nodes is calculated to construct a Shared Nearest Neighbor (SNN) matrix. The Jaccard index calculates the ratio of "neighbors" common to both nodes to all "neighbors," the greater the ratio, the more similar the two nodes.
Step S33: and (3) calculating modularity for each node and measuring the module profit after the node is added into the module by scanning the node and the neighbor nodes thereof by adopting a Louvain algorithm, selecting the neighbor node with the maximum profit to add into the module, and repeating iteration to finally form a node cluster.
It is understood that the number of clusters is determined by the Louvain algorithm as a modular function. The Louvain algorithm is a method for extracting modules from a network, and is a greedy optimization method. The Louvain algorithm calculates modularity for each node and measures the module profit after the node is added into the module by scanning the node and the neighbor nodes thereof, selects the neighbor node with the maximum profit to be added into the module, and repeats iteration to finally form a node cluster.
Specifically, a Louvain algorithm is adopted to calculate modularity for each node by scanning the node and the neighbor nodes thereof and adopting the following formula, wherein: m is the sum of the weights of the edges in the graph, i, j represents two nodes, Aij represents the weight between the two nodes, the weight is calculated by SNN, ki, kj represents the sum of the weights of all the edges of the nodes i, j, ci, cj represents the group to which the nodes i, j belong, and delta is a Kronecker delta function)
Figure BDA0003604930810000121
To maximize the Q value, the luvain algorithm calculates the modularity gain and iterates repeatedly, as shown in the formula, where Σ in is the internal node weight sum of the community into which node i enters, Σ tot is the sum of weights of node i and the connected edges of all communities in the graph, ki, in is the sum of weights of the edges between nodes i and i to be moved into the community:
Figure BDA0003604930810000122
table 2 shows the total cell count and the number of macrophages screened for the four sets of data provided in this example
Figure BDA0003604930810000123
Figure BDA0003604930810000131
Step S40: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
In this example, after the macrophages were screened, 3 groups of rare macrophage subsets consisting of IPF samples were identified by secondary clustering and differential analysis of macrophages (3 groups of cells were designated Cluster 0, Cluster 1, Cluster 2, differential gene expression Heatmap is shown in fig. 5), and all 3 groups of cells expressed IPF markers, i.e., SPP1, CCL2, FABP4, and hit1(4 marker expression plots are shown in fig. 6). SPP1 is a gene coding Osteopontin (Osteopontin), which is highly related to IPF, can be used as a disease IPF marker, and can significantly promote migration and proliferation of fibroblasts and epithelial cells. CCL2 is chemokine 2, and is capable of recruiting mononuclear macrophages and basophils and is useful as a marker for IPF diseases. FABP4 encodes a Fatty acid binding protein (Fatty acid binding protein), a cytoplasmic Fatty acid chaperone protein, that is expressed primarily in adipocytes and myeloid cells. FABP4 has a potential role in the progression of IPF disease, and its involvement in Fatty Acid Oxidation (FAO) produces large amounts of ATP that is thought to promote macrophage polarization to M1 type. Whereas M1-type macrophages, when activated, produce pro-fibrotic mediators such as TGF- β 1 (activation of fibroblasts and ECM stacking). CHIT1 is significantly expressed in the lungs of patients with IPF, has been shown to have profibrotic properties, and may be a potential new target for treatment of IPF.
According to the identification method of the rare macrophage subset and the disease marker in the idiopathic pulmonary fibrosis, disclosed data sets of human lung tissues are obtained, the data sets are preprocessed, the preprocessed data sets are subjected to data fusion, batch effects are removed, the fused data sets are clustered, the cell types of cells in each group are marked, macrophages are screened, and rare macrophage subsets and differential expression genes thereof in the macrophages are identified.
Example 2
Referring to fig. 7, according to another embodiment of the present invention, a structural diagram of an identification system for rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis is provided, which includes:
the data set acquisition unit 110: acquiring a public data set of human lung tissue, and preprocessing the data set;
the preprocessing unit 120: performing data fusion on the preprocessed data set and removing batch effect;
the screening unit 130: clustering the fused data set, marking the cell type of each group of cells and screening macrophages;
the authentication unit 140: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
The detailed implementation of the identification system for rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis provided in this example is described in detail in example 1, and is not repeated herein.
The identification system for rare macrophage subsets and disease markers in idiopathic pulmonary fibrosis, provided by the embodiment of the invention, can be used for identifying rare macrophage subsets and differential expression genes thereof in macrophages by acquiring a public data set of human lung tissues, preprocessing the data set, performing data fusion on the preprocessed data set and removing batch effects, clustering the fused data set, marking cell types of cells in each group, and screening the macrophages.
Example 3
A storage medium storing a program file capable of implementing the method for identifying rare macrophage subpopulations and disease markers in any one of the idiopathic pulmonary fibrosis.
Example 4
A processor for executing a program, wherein the program when executed performs the method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis of any one of the above.
The technical advantages of the embodiments of the present invention are at least: the method avoids the problem that the existing single cell data analysis task only stays in identifying cell subtypes from tissue samples, and on the basis, the rare macrophage subgroup can be identified by further analyzing specific cell types, and the IPF disease marker is further identified, so that reference is provided for disease diagnosis.
The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technical content can be implemented in other manners. The above-described system embodiments are merely illustrative, and for example, a division of a unit may be a logical division, and an actual implementation may have another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is substantially or partly contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (11)

1. A method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis, comprising the steps of:
step S10: acquiring a public data set of human lung tissue, and preprocessing the data set;
step S20: performing data fusion on the preprocessed data set and removing batch effect;
step S30: clustering the fused data set, marking the cell type of each group of cells and screening macrophages;
step S40: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
2. The method of claim 1, wherein the step of obtaining a public data set of human lung tissue and preprocessing the data set at step S10 comprises the steps of:
step S11: selecting 4 data sets from a GEO database, namely GSE136831, GSE135893, GSE128033 and GSE122960, wherein the 4 data sets extract human lung tissues and separate single cells for sequencing;
step S12: performing a data wash on each of the data sets, the data wash comprising removing genes that are not expressed in any cell and removing cells that express genes with a gene factor of less than 200 and cells that express mitochondria-related genes in a proportion of greater than 25%;
step S13: and carrying out data preprocessing on the cleaned data sets, wherein the preprocessing comprises Log Normalization data standardization, data scaling, cell cycle signal removal and 8000 highly-variable genes selected for each data set so as to carry out subsequent data integration.
3. The method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis as claimed in claim 1, wherein in step S20, the step of performing data fusion on the preprocessed data set and removing batch effect comprises the following steps:
step S21: finding projections of a plurality of said data sets by a canonical correlation analysis algorithm maximizes the correlation between all data sets;
step S22: a dynamic time plan adjustment algorithm is used to determine the best mapping between the data sets.
4. The method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis as claimed in claim 3, wherein in step S21, the step of finding projections of a plurality of said data sets by a canonical correlation analysis algorithm to maximize the correlation between all data sets comprises the following steps:
step S211: performing singular value decomposition on the high mutation gene expression matrix of each data set to obtain an initial typical correlation vector CCV;
step S212: and then updating the corresponding typical correlation vector W of each data set by performing singular value decomposition on each data set and other data sets until the difference ratio of W before and after updating is larger than a given threshold value.
5. The method of claim 4, wherein the step of determining the optimal mapping between the data sets using the dynamic time programming adjustment algorithm at step S22 comprises the following steps:
in a data set of a single cell, computing a warped path between the data sets using the dynamic time programming adjustment algorithm to minimize the distance between the data sets, wherein W (W1, W2 … wk) is a warped matrix, wherein each vector therein corresponds to a point in the warped path that maps elements in data set X into data set Y and minimizes the distance between them; mapping the representative correlation vector into the warp matrix to achieve alignment of the two datasets into a low dimensional space;
for more than two data sets, taking the data set with the largest number of cells as a reference data set; the other data sets are respectively aligned with the reference data set, and finally the typical correlation vectors of each data set are normalized to a common calibration space defined with the reference data set.
6. The method of claim 1, wherein the steps of clustering the fused data set and labeling the cell type of each cell group and screening macrophages in step S30 comprise the following steps: clustering the fused data set by adopting a clustering analysis algorithm based on community detection, marking the cell type of each group of cells and screening macrophages, and the method comprises the following steps:
step S31: taking the data after dimensionality reduction as input data, taking the cells as nodes, and calculating the Euclidean distance between each node through a KNNG algorithm to determine the K-nearest neighbor of each node;
step S32: calculating neighborhood overlap between each node and K nearest neighbor nodes thereof by adopting the following formula to construct a shared nearest neighbor matrix, wherein A and B represent the number of 'neighbor' nodes of the two nodes: j (a, B) ═ a ≠ B |)/(| a | + | B | - | a |);
step S33: calculating modularity for each node and measuring module profit after the node is added into the module by scanning the node and neighbor nodes thereof by adopting a Louvain algorithm, selecting the neighbor node with the maximum profit to be added into the module, and repeating iteration to finally form a node cluster;
step S34: differential analysis was performed using Wilcoxon rank-sum test and cell types were judged and macrophages were screened based on the genes differentially upregulated for each population of cells.
7. The method for identifying the rare macrophage subpopulations and the disease markers in the idiopathic pulmonary fibrosis according to claim 6, wherein in step S33, a Louvain algorithm is adopted to scan the nodes and the neighbor nodes thereof, calculate the modularity for each node and measure the module profit after the node is added into the module, select the neighbor node with the maximum profit to add into the module, repeat iteration, and finally form the node cluster, the method specifically comprises the following steps:
and calculating modularity for each node by adopting a Louvain algorithm through scanning the nodes and neighbor nodes thereof and adopting the following formula, wherein: m is the sum of the weights of the edges in the graph, i, j represents two nodes, Aij represents the weight between the two nodes, the weight is calculated by SNN, ki, kj represents the sum of the weights of all the edges of the nodes i, j, ci, cj represents the group to which the nodes i, j belong, and delta is a Kronecker delta function):
Figure FDA0003604930800000041
to maximize the Q value, the luvain algorithm calculates the modularity gain and iterates repeatedly, as shown in the formula, where Σ in is the internal node weight sum of the community into which node i enters, Σ tot is the sum of weights of node i and the connected edges of all communities in the graph, ki, in is the sum of weights of the edges between nodes i and i to be moved into the community:
Figure FDA0003604930800000042
8. the method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis as claimed in claim 6, wherein in step S40, the step of identifying rare macrophage subtypes and their differentially expressed genes in said macrophages comprises the following steps:
identifying 3 groups of rare macrophage subpopulations consisting of IPF samples, 3 groups of cells being designated Cluster 0, Cluster 1, Cluster 2, said 3 groups of cells all expressing IPF markers, namely SPP1, CCL2, FABP4, hit1, wherein: the SPP1 is a gene for coding osteopontin, can remarkably promote migration and proliferation of fibroblasts and epithelial cells, and can be used as a disease IPF marker; CCL2 is chemokine 2, is capable of recruiting mononuclear macrophages, is capable of promoting fibrosis through various mechanisms involving inflammation, angiogenesis and myofibroblast accumulation, and can serve as a marker for IPF disease; the FABP4 encoded fatty acid binding protein is a cytoplasmic fatty acid chaperone protein, is expressed in fat cells and myeloid cells, can promote ATP polarized to M1 type by macrophages, and is involved in generation of IPF by promoting activation of M1 type macrophages; the CHIT1 encodes chittriosidase, has a profibrotic effect and is significantly expressed in the lungs of IPF patients.
9. A system for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis, comprising:
a data set acquisition unit: acquiring a public data set of human lung tissue, and preprocessing the data set;
a pretreatment unit: performing data fusion on the preprocessed data set and removing batch effect;
screening unit: clustering the fused data set, marking the cell type of each group of cells and screening macrophages;
an identification unit: identifying rare macrophage subtypes and differentially expressed genes thereof in the macrophages.
10. A storage medium storing a program file capable of implementing the method for identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis according to any one of claims 1-8.
11. A processor configured to execute a program, wherein the program when executed performs the method of identifying rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis of any one of claims 1-8.
CN202210414444.4A 2022-04-20 2022-04-20 Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis Pending CN114708918A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210414444.4A CN114708918A (en) 2022-04-20 2022-04-20 Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210414444.4A CN114708918A (en) 2022-04-20 2022-04-20 Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis

Publications (1)

Publication Number Publication Date
CN114708918A true CN114708918A (en) 2022-07-05

Family

ID=82174585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210414444.4A Pending CN114708918A (en) 2022-04-20 2022-04-20 Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis

Country Status (1)

Country Link
CN (1) CN114708918A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110160070A1 (en) * 2008-03-10 2011-06-30 Lineagen, Inc. Copd biomarker signatures
US20190263912A1 (en) * 2016-11-11 2019-08-29 The Broad Institute, Inc. Modulation of intestinal epithelial cell differentiation, maintenance and/or function through t cell action
CN113862351A (en) * 2020-06-30 2021-12-31 清华大学 Kit and method for identifying extracellular RNA biomarkers in body fluid sample

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110160070A1 (en) * 2008-03-10 2011-06-30 Lineagen, Inc. Copd biomarker signatures
US20190263912A1 (en) * 2016-11-11 2019-08-29 The Broad Institute, Inc. Modulation of intestinal epithelial cell differentiation, maintenance and/or function through t cell action
CN113862351A (en) * 2020-06-30 2021-12-31 清华大学 Kit and method for identifying extracellular RNA biomarkers in body fluid sample

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHRISTINA MORSE: "Proliferating SPP1/MERTK-expressing macrophages in idiopathic pulmonary fibrosis", EUR RESPIR J, 22 August 2019 (2019-08-22), pages 1 - 25 *
YANYU: "Single-cell RNA sequencing identifies diverse roles of epithelial cells in idiopathic pulmonary fibrosis", JCI INSIGHT, 8 December 2016 (2016-12-08), pages 1 - 19 *

Similar Documents

Publication Publication Date Title
Goetz et al. Unified classification of mouse retinal ganglion cells using function, morphology, and gene expression
Ulyantsev et al. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data
CN110289050B (en) Drug-target interaction prediction method based on graph convolution sum and word vector
de Oliveira et al. Comparing co-evolution methods and their application to template-free protein structure prediction
Chen et al. Objective clustering of proteins based on subcellular location patterns
Gouwens et al. Toward an integrated classification of neuronal cell types: morphoelectric and transcriptomic characterization of individual GABAergic cortical neurons
Zhang et al. The effect of tissue composition on gene co-expression
Jeong et al. PRIME: a probabilistic imputation method to reduce dropout effects in single-cell RNA sequencing
WO2024108663A1 (en) Tumor survival prediction method and apparatus, electronic device and storage medium
CN113421608A (en) Construction method, detection device and computer readable medium of liver cancer early screening model
WO2021214102A1 (en) Methods of determining correspondences between biological properties of cells
Cui et al. Boosting gene expression clustering with system-wide biological information: a robust autoencoder approach
Gilad et al. Fully unsupervised symmetry-based mitosis detection in time-lapse cell microscopy
JP2023546645A (en) Methods and systems for subsampling cells from single cell genomics datasets
CN113903398A (en) Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
Riva et al. Integration of multiple scRNA-seq datasets on the autoencoder latent space
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN114708918A (en) Identification method and system for rare macrophage subpopulations and disease markers in idiopathic pulmonary fibrosis
García Osuna et al. Large-scale automated analysis of location patterns in randomly tagged 3T3 cells
Gouwens et al. Toward an integrated classification of cell types: morphoelectric and transcriptomic characterization of individual GABAergic cortical neurons
Soemartojo et al. Iterative bicluster-based Bayesian principal component analysis and least squares for missing-value imputation in microarray and RNA-sequencing data
CN111383708A (en) Small molecule target prediction algorithm based on chemical genomics and application thereof
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
Leung et al. Gene selection for brain cancer classification
CN110751983A (en) Method for screening characteristic mRNA (messenger ribonucleic acid) for diagnosing early lung cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination