WO2020198942A1

WO2020198942A1 - Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Info

Publication number: WO2020198942A1
Application number: PCT/CN2019/080443
Authority: WO
Inventors: 瞿昆; 方靖文; 黎斌; 李杨
Original assignee: 中国科学技术大学
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-08

Abstract

A single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering. The method comprises: comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching for peaks on the basis of the comparison result, and calculating a reading within each peak to obtain cell*peak reading matrices; and calculating a mathematical distance between the peaks in the cell*peak reading matrices, clustering the peaks, and merging the cell*peak reading matrices into a cell*accesson reading matrix, wherein accesson is the clustered peak. The method provides the first scATAC-seq data analysis method and system from fastq to clustering, visualization and development path reshaping, and significantly improves the clustering effect.

Description

基于峰聚类的单细胞染色质可及性测序数据分析方法和***Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

技术领域Technical field

本发明属于生物测序数据分析技术领域，具体涉及一种基于峰聚类的单细胞染色质可及性测序数据分析方法和***。The invention belongs to the technical field of biological sequencing data analysis, and specifically relates to a single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering.

背景技术Background technique

ATAC-seq自2012年发明以来，由于简洁、廉价、所需细胞少的优点，在生物学领域的研究中广泛普及，在胚胎发育、干细胞分化、癌症机理和分型等研究贡献了突破性的进展。如2017年一篇CANCER Cell(IF＝24)发现可用ATAC-seq解释T细胞淋巴瘤的发病机理和精准用药分型，2018年ATAC-seq数据进入TCGA数据库。因此，为进一步研究细胞异质性，scATAC-seq测序技术在2015年被人提出并在几年的发展中实现了多种不同技术方案，随之产生就是scATAC-seq测序结果数据的分析解读。Since its invention in 2012, ATAC-seq has been widely used in biological research due to its advantages of simplicity, low cost, and fewer cells. It has contributed a breakthrough in research on embryonic development, stem cell differentiation, cancer mechanism and typing, etc. progress. For example, in a 2017 CANCER Cell (IF=24), it was found that ATAC-seq could be used to explain the pathogenesis and precise drug classification of T-cell lymphoma. In 2018, ATAC-seq data entered the TCGA database. Therefore, in order to further study the heterogeneity of cells, the scATAC-seq sequencing technology was proposed in 2015 and achieved a variety of different technical solutions in several years of development. The resulting data is the analysis and interpretation of scATAC-seq sequencing results.

scATAC-seq数据分析的主要目的，即通过测序结果，还原混合生物样本中的主要细胞群体或发育分化路径。然而，目前的scATAC-seq技术比较前沿，数据的信噪比较低。因此，scATAC-seq数据分析需要一套易于使用的分析方法，并最大程度的还原细胞异质性信息。目前已公开的scATAC-seq数据分析方法，一方面尚无从fastq起始的，到聚类、可视化、发育路径重建这样一条完善的、易于使用分析流程。另一方面，通过使用金标准测试数据集评估，即一些已知每个细胞所属亚群或发育分化路径中的位置的测试数据集。已有方法在信息还原上仍效果不佳，亟需改进(利用ARI评估)。也正因如此，scATAC-seq分析目前并无业内统一的分析方法。The main purpose of scATAC-seq data analysis is to restore the main cell populations or developmental differentiation pathways in mixed biological samples through sequencing results. However, the current scATAC-seq technology is relatively cutting-edge, and the signal-to-noise ratio of the data is low. Therefore, scATAC-seq data analysis requires a set of easy-to-use analysis methods and restores cell heterogeneity information to the greatest extent. The currently published scATAC-seq data analysis method, on the one hand, does not have a complete and easy-to-use analysis process from fastq to clustering, visualization, and developmental path reconstruction. On the other hand, it is evaluated by using the gold standard test data set, that is, some test data sets where each cell belongs to the subgroup or the position in the developmental differentiation path. Existing methods are still ineffective in information restoration and urgently need to be improved (using ARI evaluation). For this reason, scATAC-seq analysis does not currently have a unified analysis method in the industry.

现有技术中有以下三种分析方法：ChromVAR，LSI和Cicero。There are three analysis methods in the prior art: ChromVAR, LSI and Cicero.

在ChromVAR方法中，该方法输入数据为细胞*峰的读数矩阵，及每个峰的序列信息.该方法通过已知的转录因子motif信息，对每个峰，计算转录因子的偏好程度。由此构建一个细胞*转录因子的偏好分数矩阵，并用此矩阵进行信息还原。In the ChromVAR method, the input data of this method is the reading matrix of the cell*peak and the sequence information of each peak. This method uses the known transcription factor motif information to calculate the preference degree of the transcription factor for each peak. This constructs a preference score matrix for cell * transcription factors, and uses this matrix to restore information.

在LSI方法中，该方法输入数据为细胞*峰的读数矩阵.该方法通过TF-IDF算法(词频(Term Frequency)，IDF意思是逆文本频率指数)，将矩阵复杂化，然后通过新矩阵进行信息还原。In the LSI method, the input data of this method is the cell * peak reading matrix. This method uses the TF-IDF algorithm (term frequency (Term Frequency), IDF means inverse text frequency index) to complicate the matrix, and then use a new matrix to perform Information restoration.

在Cicero方法中，该方法输入数据为细胞*峰的读数矩阵，和峰在染色体上的位置信息.该方法通过峰在染色质上的位置，将距离在一定绝对空间的峰的读数合并(如：250kb以内的峰)。然后用此矩阵进行下游信息还原。In the Cicero method, the input data of this method is the reading matrix of the cell*peak and the position information of the peak on the chromosome. This method combines the readings of the peaks in a certain absolute space by the position of the peak on the chromatin (such as : Peaks within 250kb). Then use this matrix to restore downstream information.

发明内容Summary of the invention

有鉴于此，本发明提出一种完备的、易于使用的且具有高效细胞异质性信息还原能力的生物学样本scATAC-seq数据分析方法和***。In view of this, the present invention proposes a complete, easy-to-use, and efficient biological sample scATAC-seq data analysis method and system with efficient cell heterogeneity information reduction ability.

为了达到上述目的，一方面，本发明提出了一种基于峰聚类的单细胞染色质可及性测序数据分析方法，包括：In order to achieve the above objective, on the one hand, the present invention proposes a single-cell chromatin accessibility sequencing data analysis method based on peak clustering, including:

将单细胞染色质可及性测序数据与相应的生物样本基因组数据进行比对获得比对结果，并在所述比对结果的基础上寻峰，并计算每个峰内读数，得到细胞*峰的读数矩阵；Compare the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain the comparison result, and find the peak based on the comparison result, and calculate the reading within each peak to obtain the cell*peak The reading matrix;

计算细胞*峰的读数矩阵中峰与峰之间的数学距离，将峰聚类，并将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵，其中accesson为聚类后的峰。Calculate the mathematical distance between the peaks in the cell*peak reading matrix, cluster the peaks, and merge the cell*peak reading matrix into the cell*accesson reading matrix, where accesson is the clustered peak.

在一些实施例中，所述方法还包括将所述细胞*accesson的读数矩阵降维为二位可视化矩阵，优选地，降维的方法包括PCA、T-SNE或UMAP。In some embodiments, the method further includes reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.

在一些实施例中，所述方法还包括根据所述细胞*accesson的读数矩阵对细胞进行聚类，优选地，聚类算法包括KNN聚类、kernel聚类或louvain聚类。In some embodiments, the method further includes clustering the cells according to the reading matrix of the cell *accesson. Preferably, the clustering algorithm includes KNN clustering, kernel clustering or louvain clustering.

在一些实施例中，所述方法还包括利用所述细胞*accesson的读数矩阵构建细胞发育路径假时间情况，优选地，构建细胞发育路径假时间情况时所用算法包括SPRING或monocle。In some embodiments, the method further includes using the read matrix of the cell *accesson to construct the false time condition of the cell development path. Preferably, the algorithm used when constructing the false time condition of the cell development path includes SPRING or monocle.

另一方面，本发明提出了一种基于峰聚类的单细胞染色质可及性测序数据分析***，包括预处理模块和accesson构建模块；On the other hand, the present invention provides a single-cell chromatin accessibility sequencing data analysis system based on peak clustering, including a preprocessing module and an accesson building module;

其中，预处理模块包括a)比对单元，用于将单细胞染色质可及性测序数据与相应的生物样本基因组数据进行比对获得比对结果；b)寻峰单元，用于将所有单细胞的比对结果合并，然后寻峰；c)读数计算单元，计算每个峰内的读数，得到细胞*峰的读数矩阵；Among them, the preprocessing module includes a) a comparison unit, which is used to compare single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result; b) a peak finding unit, which is used to compare all single cells The comparison results of the cells are combined, and then the peak is searched; c) The reading calculation unit calculates the readings in each peak to obtain the reading matrix of the cell*peak;

accesson构建模块包括a)峰距离计算单元，用于计算细胞*峰的读数矩阵中峰与峰之间的数学距离；b)峰聚类单元，用于根据峰与峰之间的数学距离将峰聚类；c)矩阵转换单元，用于将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵，其中accesson为聚类后的峰。The accesson building module includes a) a peak distance calculation unit, used to calculate the mathematical distance between peaks in the cell*peak reading matrix; b) a peak clustering unit, used to cluster peaks based on the mathematical distance between peaks C) A matrix conversion unit for combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson, where the accesson is the peak after clustering.

在一些实施例中，所述***还包括可视化模块，用于将所述细胞*accesson的读数矩阵降维为二位可视化矩阵，优选地，降维的方法包括PCA、T-SNE或UMAP。In some embodiments, the system further includes a visualization module for reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.

在一些实施例中，所述***还包括细胞聚类模块，用于根据所述细胞*accesson的读数矩阵对细胞进行聚类，优选地，聚类算法包括KNN聚类、kernel聚类或louvain聚类。In some embodiments, the system further includes a cell clustering module, which is used to cluster the cells according to the reading matrix of the cell *accesson. Preferably, the clustering algorithm includes KNN clustering, kernel clustering or louvain clustering. class.

在一些实施例中，所述***还包括细胞发育路径重塑模块，用于利用所述细胞*accesson的读数矩阵构建细胞发育路径假时间情况，优选地，构建细胞发育路径假时间情况时所用算法包括SPRING或monocle。In some embodiments, the system further includes a cell development path remodeling module, which is used to construct the false time condition of the cell development path using the reading matrix of the cell *accesson, preferably, the algorithm used when constructing the false time condition of the cell development path Including SPRING or monocle.

在一些实施例中，所述数学距离包括欧氏距离、皮尔逊相关系数或cityblock距离。In some embodiments, the mathematical distance includes Euclidean distance, Pearson correlation coefficient, or cityblock distance.

在一些实施例中，所述峰聚类的方法包括KNN、DBSAN或K-Mean。In some embodiments, the peak clustering method includes KNN, DBSAN, or K-Mean.

在一些实施例中，将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵的方法包括取accesson中峰读数的和、峰读数的平均值、峰读数的中位数或峰读数的方差。In some embodiments, the method of combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson includes taking the sum of the peak readings in the accesson, the average of the peak readings, the median of the peak readings, or the variance of the peak readings.

又一方面，本发明还提出一种基于峰聚类的单细胞染色质可及性测序数据分析装置，包括：In another aspect, the present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:

处理器；processor;

存储器，其上存储有指令，所述指令在由所述处理器执行时使得所述处理器执行所述分析方法。A memory has instructions stored thereon, and when the instructions are executed by the processor, the processor executes the analysis method.

又一方面，本发明还提出一种存储指令的计算机可读存储介质，所述指令在由处理器执行时使得所述处理器执行所述分析方法。In yet another aspect, the present invention also provides a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to execute the analysis method.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提供了首个从fastq到聚类、可视化和发育路径重塑的scATAC-seq数据分析方法和***；The present invention provides the first scATAC-seq data analysis method and system from fastq to clustering, visualization and developmental path reshaping;

本发明提出了基于峰聚类的accesson构建方法，作为scATAC-seq数据分析的关键模块。将转化后的细胞*accesson读数矩阵用于后续聚类、可视化和细胞发育路径重塑。在金标注数据集测试上，分群效果统计学上显著的高于已有方法(ARI)。The present invention proposes an accesson construction method based on peak clustering as a key module of scATAC-seq data analysis. The transformed cell *accesson reading matrix is used for subsequent clustering, visualization and cell development path remodeling. In the gold-labeled data set test, the clustering effect is statistically significantly higher than the existing method (ARI).

附图说明Description of the drawings

图1为本发明实施例中基于峰聚类的accesson构建及下游分析示意图；Figure 1 is a schematic diagram of accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention;

图2为本发明实施例中accesson数目与聚类效果ARI间的关系(金标测试数据集1)；Figure 2 shows the relationship between the number of accesson and the clustering effect ARI in an embodiment of the present invention (gold standard test data set 1);

图3为本发明实施例中人白血病细胞及相关谱系细胞scATAC-seq数据：A.数据聚类(层次聚类)和B.可视化效果(tSNE)；Figure 3 is the scATAC-seq data of human leukemia cells and related lineage cells in an embodiment of the present invention: A. Data clustering (hierarchical clustering) and B. Visualization effect (tSNE);

图4为本发明实施例中人造血干细胞发育分化谱系相关scATAC-seq数据：数据发育路径重塑(monocle)；Figure 4 shows the scATAC-seq data related to the development and differentiation lineage of the artificial hematopoietic stem cells in the embodiment of the present invention: data development path remodeling (monocle);

图5为本发明实施例中小鼠前脑神经细胞scATAC-seq数据：数据聚类(KNN)和可视化(tSNE)；Figure 5 is the scATAC-seq data of mouse forebrain nerve cells in an embodiment of the present invention: data clustering (KNN) and visualization (tSNE);

图6A-6D为本发明实施例中小鼠胸腺T细胞scATAC-seq数据：数据聚类(Louvain、层次聚类)、可视化(tSNE)及发育路径重塑(monocle)；6A-6D are mouse thymic T cell scATAC-seq data in an embodiment of the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE) and developmental path remodeling (monocle);

图7为本发明实施例中在聚类效果和用时上与已有方法比较(金标测试数据集1)；FIG. 7 is a comparison between the clustering effect and time used in the embodiment of the present invention with existing methods (gold standard test data set 1);

图8为本发明实施例中在聚类效果和用时上与已有方法比较(金标测试数据集2)。Fig. 8 is a comparison of the clustering effect and time used in the embodiment of the present invention with the existing method (gold standard test data set 2).

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明作进一步的详细说明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

为便于理解，在此处将本文中所涉及的领域专有名词在此统一解释，后不赘述。For ease of understanding, the domain-specific terms involved in this article are explained here in a unified manner, and will not be repeated here.

细胞：哺乳动物(如人、鼠)行驶生命活动的基本组成元件，往往也是各种疾病的发病机理，如神经细胞、上皮细胞、肿瘤细胞。Cells: Mammals (such as humans and mice) are the basic components of life activities and are often the pathogenesis of various diseases, such as nerve cells, epithelial cells, and tumor cells.

细胞异质性：生物组织样本(如肿瘤组织、脑组织)由大量细胞组成，其组成细胞的生理功能不尽相同。常见的细胞异质性有以下两种体现：1)组成细胞由多种明确的细胞群体组成(离散)。2)组成细胞处在连续的细胞分化路径中(连续)。Cell heterogeneity: Biological tissue samples (such as tumor tissue, brain tissue) are composed of a large number of cells, and the physiological functions of the constituent cells are not the same. Common cell heterogeneity has the following two manifestations: 1) The constituent cells are composed of a variety of clear cell populations (discrete). 2) The constituent cells are in a continuous cell differentiation path (continuous).

基因组：即生物体全DNA序列，由ATCG四种碱基有序排列组成。人、鼠等主要哺乳动物的基因组已经全部测序完成。Genome: The whole DNA sequence of an organism, composed of four ATCG bases arranged in an orderly manner. The genomes of major mammals such as humans and mice have all been sequenced.

基因：基因(遗传因子)是产生一条多肽链或功能RNA所需的全部DNA序列。一个基因一般是基因组上一段或多段DNA。Genes: Genes (genetic factors) are all DNA sequences required to produce a polypeptide chain or functional RNA. A gene is generally one or more segments of DNA in the genome.

转录因子：一种结合在DNA上的蛋白质，启动或调节基因的表达。其结合在DNA上往往通过识别特定的DNA序列模式(Motif)。Transcription factor: A protein that binds to DNA to initiate or regulate gene expression. It binds to DNA often by recognizing specific DNA sequence patterns (Motif).

染色质：细胞核内由DNA、组蛋白、非组蛋白及少量RNA组成的线性复合结构。其基本原件为DNA缠绕在组蛋白上形成的核小体。Chromatin: A linear composite structure composed of DNA, histones, non-histone proteins and a small amount of RNA in the nucleus. The basic original is the nucleosome formed by DNA winding on histone.

染色质可及性：即评价某段DNA是否缠绕在组蛋白上。一般情况下，染色质可及性有两种情况：1)DNA紧紧缠绕在核小体上，称为关闭的DNA；2)DNA为缠绕在核小体上，呈裸露状态，称为开放的DNA。Chromatin accessibility: to evaluate whether a certain piece of DNA is entangled on histones. Under normal circumstances, there are two situations for chromatin accessibility: 1) DNA is tightly wound around nucleosomes, called closed DNA; 2) DNA is wound around nucleosomes and is naked, called open DNA.

染色质可及性测序(ATAC-seq)：2012年斯坦福大学开发的一种用于检测生物样本(＞500细胞)染色质可及性情况的测序技术。Chromatin Accessibility Sequencing (ATAC-seq): A sequencing technology developed by Stanford University in 2012 to detect chromatin accessibility in biological samples (>500 cells).

TCGA：即癌症和肿瘤基因图谱计划(Cancer Genome Atlas，TCGA)。包含33种不同癌症及11,000患者的癌症组织和正常组织的不同组学测序数据。TCGA: The Cancer Genome Atlas (TCGA). Contains different omics sequencing data of cancer tissues and normal tissues from 33 different cancers and 11,000 patients.

单细胞染色质可及性测序(scATAC-seq)：已有的几种用于检测单个细胞染色质可及性的测序方法的统称。包括单核染色质可及性测序(snATAC-seq)，单细胞组合索引染色质可及性测序(sciATAC-seq)，基于流式的单细胞染色质可及性测序(FACS scATAC-seq)。Single-cell chromatin accessibility sequencing (scATAC-seq): A collective term for several existing sequencing methods used to detect the chromatin accessibility of a single cell. Including mononuclear chromatin accessibility sequencing (snATAC-seq), single-cell composite index chromatin accessibility sequencing (sciATAC-seq), flow-based single-cell chromatin accessibility sequencing (FACS scATAC-seq).

短序列(Sequence reads)：即生物组学中，获得的DNA片段。Sequence reads: DNA fragments obtained in bioomics.

比对(Mapping)：将短序列与已知的基因组信息比较，找到每个短序列在基因组上的位置。Mapping: Compare the short sequence with the known genome information, and find the position of each short sequence on the genome.

寻峰(Peak Calling)：及通过数据分析比对后的结果，寻找DNA开放的位置，其位置信息，称为峰，并赋予编号。Peak Calling: and through the results of data analysis and comparison, find the open position of the DNA. The position information is called a peak and assigned a number.

读数：即每个样本、每个峰中，短序列的数目。Readings: the number of short sequences in each sample and each peak.

Accesson：本发明提出的峰聚类结果代称，一个Accesson即峰的聚类情况。如Accesson 1＝峰2，峰3，峰5；Accesson 2＝峰1，峰4。Accesson: The peak clustering result proposed by the present invention is called the clustering situation of a peak. For example, Accesson 1=peak 2, peak 3, peak 5; Accesson 2=peak 1, peak 4.

ARI(Adjusted Rand index)是聚类算法常用的评估指数，用于评估算法分群结果与实际分群结果的一致性。ARI (Adjusted Rand index) is a commonly used evaluation index for clustering algorithms, which is used to evaluate the consistency of the clustering results of the algorithm with the actual clustering results.

本发明的一个实施例提出了一种基于峰聚类的单细胞染色质可及性(scATAC-seq)测序数据分析***(以下简称为APEC)：包括以下几个模块：An embodiment of the present invention proposes a single-cell chromatin accessibility (scATAC-seq) sequencing data analysis system based on peak clustering (hereinafter referred to as APEC): It includes the following modules:

1)预处理模块：包括a)比对单元，用于fastq文件(即单细胞染色质可及性测序数据)比对到基因组序列，形成bam文件；b)寻峰单元，用于将所有单细胞比对结果的bam文件合并成merge_bam文件，并在此基础上寻峰；c)读数计算单元，通过计算每个峰内reads的计数，最后输出细胞*峰的读数矩阵。1) Preprocessing module: including a) alignment unit, used to compare fastq files (single-cell chromatin accessibility sequencing data) to genome sequences to form bam files; b) peak finding unit, used to compare all single The bam files of the cell comparison results are merged into a merge_bam file, and peaks are searched on this basis; c) The reading calculation unit calculates the count of reads in each peak, and finally outputs the reading matrix of the cell*peak.

2)accesson构建模块：包括a)峰距离计算单元，通过细胞*峰的读数矩阵，计算峰与峰之间的数学距离(包括但不限于欧氏距离，皮尔逊相关系数、cityblock距离)；b)峰聚类单元，通过峰与峰之间的数学距离，将峰聚类，聚类后的峰称为accesson，聚类所用方法包括但不限于(KNN，DBSAN)。c)矩阵转换单元，依照accesson信息，将细胞*峰的读数矩阵合并为细胞*accesson矩阵，合并方法包括但不限于取accesson中峰读数的和、平均值、中位数、方差等。2) Accesson building module: including a) Peak distance calculation unit, which calculates the mathematical distance between peaks (including but not limited to Euclidean distance, Pearson correlation coefficient, cityblock distance) through the cell * peak reading matrix; b) The peak clustering unit uses the mathematical distance between the peaks to cluster the peaks. The peaks after clustering are called accesson. The clustering methods include but are not limited to (KNN, DBSAN). c) The matrix conversion unit, according to the accesson information, merges the cell*peak reading matrix into the cell*accesson matrix. The merging method includes but is not limited to taking the sum, average, median, and variance of the peak readings in the accesson.

3)可视化模块：将细胞*accesson读数矩阵降维为二位可视化矩阵，所用降维可视化方法包括但不限于PCA、T-SNE、UMAP。3) Visualization module: Reduce the dimensionality of the cell *accesson reading matrix to a two-digit visualization matrix. The dimensionality reduction visualization methods used include but are not limited to PCA, T-SNE, UMAP.

4)细胞聚类模块：将细胞*accesson读数矩阵，对细胞进行聚类，聚类算法包括但不限于KNN聚类、kernel聚类、louvain聚类。4) Cell clustering module: use the cell *accesson reading matrix to cluster cells. Clustering algorithms include but are not limited to KNN clustering, kernel clustering, and louvain clustering.

5)细胞发育路径重塑模块：利用细胞*accesson读数矩阵，构建细胞发育路径假时间情况，所用算法包括但不限于SPRING、monocle。5) Cell development path remodeling module: Use cell *accesson reading matrix to construct false time situation of cell development path. Algorithms used include but not limited to SPRING and monocle.

以下是在根据本发明的实施例中，APEC在4种不同金标测试数据集的使用情况，用于说明APEC在不同生物样本scATAC-seq数据分析的普适性，数据集包括：1)人白血病细胞及相关谱系细胞scATAC-seq数据；2)人造血干细胞发育分化谱系相关scATAC-seq数据；3)小鼠前脑神经细胞scATAC-seq数据；4)小鼠胸腺T细胞scATAC-seq数据。The following is the usage of APEC in 4 different gold standard test data sets in the embodiment of the present invention to illustrate the universality of APEC in scATAC-seq data analysis of different biological samples. The data set includes: 1) People ScATAC-seq data of leukemia cells and related lineage cells; 2) scATAC-seq data related to the development and differentiation lineage of artificial hematopoietic stem cells; 3) scATAC-seq data of mouse forebrain nerve cells; 4) scATAC-seq data of mouse thymic T cells.

利用本发明的基于峰聚类的scATAC-seq分析***(APEC)的分析流程包括以下步骤：The analysis process of the scATAC-seq analysis system (APEC) based on peak clustering of the present invention includes the following steps:

1)数据输入：1) Data input:

输入数据为fastq文件，其格式可为：a)，每个细胞单独一个fastq文件；b)，一个混合在一起的fastq文件，但每个细胞的可通过数据提供方给定的拆分规则来拆分成每个细胞数据。如index序列(利用fastq前5-10个base的不同拆分)The input data is a fastq file, and its format can be: a), a single fastq file for each cell; b), a mixed fastq file, but each cell can be split by the split rule given by the data provider Split the data into each cell. Such as index sequence (using different splits of the first 5-10 bases of fastq)

2)数据预处理：2) Data preprocessing:

输入数据经比对单元，可比对到不同的生物样本基因组上，如数据集1，2比对至人基因组，数据集3，4比对至鼠基因组。或由数据提供方指定的生物样本基因组。比对结果产生Bam文件，该文件表明每个fastq中的read比对到基因组上的位置。利用寻峰单元处理bam文件，可定义生物样本中的染色质开放位点，结合读数计算单元，可获得每个细胞(m)每个峰(n)的读数矩阵(m×n)。The input data can be compared to different biological sample genomes through the comparison unit, for example, data sets 1, 2 are compared to the human genome, and data sets 3, 4 are compared to the mouse genome. Or the biological sample genome designated by the data provider. The result of the comparison produces a Bam file, which indicates the position of the read in each fastq to the genome. Using the peak finding unit to process the bam file, the chromatin open sites in the biological sample can be defined, combined with the reading calculation unit, the reading matrix (m×n) of each cell (m) and each peak (n) can be obtained.

3)accesson构建：3) Accesson construction:

图1为本发明实施例中基于峰聚类的accesson构建及下游分析示意图。在accesson构建中，首先将m×n的读数矩阵传入accesson构建模块。Fig. 1 is a schematic diagram of accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention. In the accesson construction, the m×n reading matrix is first transferred to the accesson building module.

在峰距离计算单元中，可以使用欧式距离计算峰与峰之间的相对距离(数据集1，2，3，4)，也可使用其他常用的向量距离计算方法，如皮尔逊相关系数、cityblock距离等。In the peak distance calculation unit, Euclidean distance can be used to calculate the relative distance between peaks (

data set

1, 2, 3, 4), and other commonly used vector distance calculation methods can also be used, such as Pearson correlation coefficient, cityblock distance Wait.

在峰聚类单元中，可以利用KNN算法将峰聚类成指定数目的Accesson(数据集1，2，3，4)。其中，聚类算法可以为常见的向量聚类算法，如DBSCAN，K-Mean等。其中指定的accesson数目在广泛的距离上不会对结果产生影响(图2)，因此默认为2000，可根据具体数据调整。In the peak clustering unit, the KNN algorithm can be used to cluster the peaks into a specified number of Accesson (

data set

1, 2, 3, 4). Among them, the clustering algorithm can be a common vector clustering algorithm, such as DBSCAN, K-Mean, etc. The number of specified accesson will not affect the result over a wide distance (Figure 2), so the default is 2000, which can be adjusted according to specific data.

在矩阵转换单元中，首先根据accesson的基本性质对accesson进行一定的筛选，如剔除内含的峰数小于指定数值的accesson，或剔除内部的基尼系数小于指定数值的accesson。之后，依照accesson信息，将细胞*峰的读数矩阵合并为细胞*accesson矩阵，合并方法为取accesson中峰读数之和(数据集1，2，3，4)。同时还可以利用其他简单的向量性质计算方法，如读数的平均值、读数的中位数、读数的方差等。In the matrix conversion unit, the accesson is first selected according to the basic nature of the accesson, such as removing the accesson whose peak number is less than the specified value, or removing the accesson whose internal Gini coefficient is less than the specified value. After that, according to the accesson information, the cell*peak reading matrix is merged into the cell*accesson matrix. The merging method is to take the sum of the peak readings in the accesson (

data set

1, 2, 3, 4). At the same time, you can also use other simple vector property calculation methods, such as the average of the readings, the median of the readings, and the variance of the readings.

4)数据聚类及可视化4) Data clustering and visualization

在该步骤中，可利用可视化模块将细胞*accesson读数矩阵降维为二位可视化矩阵，和/或利用细胞聚类模块对细胞进行聚类，和/或利用细胞发育路径重塑模块构建细胞发育路径假时间情况。In this step, the visualization module can be used to reduce the dimension of the cell *accesson reading matrix to a two-digit visualization matrix, and/or the cell clustering module can be used to cluster cells, and/or the cell development path remodeling module can be used to construct cell development Route false time situation.

图3为人白血病细胞及相关谱系细胞scATAC-seq数据：A.数据聚类(层次聚类)和B.可视化效果(tSNE)；Figure 3 shows scATAC-seq data of human leukemia cells and related lineage cells: A. Data clustering (hierarchical clustering) and B. Visualization effect (tSNE);

图4为人造血干细胞发育分化谱系相关scATAC-seq数据：数据发育路径重塑(monocle)；Figure 4 shows scATAC-seq data related to the development and differentiation lineage of artificial hematopoietic stem cells: data development path remodeling (monocle);

图5为小鼠前脑神经细胞scATAC-seq数据：数据聚类(KNN)和可视化(tSNE)；Figure 5 shows scATAC-seq data of mouse forebrain nerve cells: data clustering (KNN) and visualization (tSNE);

图6A-6D为小鼠胸腺T细胞scATAC-seq数据：其中，图6A为Louvain聚类，图6B为层次聚类，图6C为可视化(tSNE)，图6D为发育路径重塑(monocle)。Figures 6A-6D are mouse thymic T cell scATAC-seq data: Figure 6A is Louvain clustering, Figure 6B is hierarchical clustering, Figure 6C is visualization (tSNE), and Figure 6D is developmental path remodeling (monocle).

可见，本发明可以实现从fastq到聚类、可视化和发育路径重塑。并且在金标注数据集测试上，分群效果(ARI)统计学上显著的高于已有方法，如图7和8所示。其能高效的还原细胞异质性信息的原因在于，本方法提出的accesson构建方法是一种降噪并放大信号的滤波过程，其细节在于：1)相比LSI、ChromVAR，本发明能将原本稀疏的细胞*峰矩阵，转化成更为致密的细胞*accesson矩阵，降低了后续分析中的噪音信号；2)相比基于染色质位置进行峰合并的Cicero方法，本发明通过数学距离和聚类算法，将峰聚类后合并。这种方法聚类在一起的峰有相似的表达模式，因此，accesson的构建更具有生物学意义，如一个accesson内部的峰可能受同一个转录因子调控，或在染色质三维结构中更接近。因此转化后的细胞*accesson矩阵，进一步放大了细胞异质性情况。It can be seen that the present invention can realize reshaping from fastq to clustering, visualization and developmental path. And in the gold-labeled data set test, the clustering effect (ARI) is statistically significantly higher than the existing methods, as shown in Figures 7 and 8. The reason why it can efficiently restore cell heterogeneity information is that the accesson construction method proposed in this method is a filtering process that reduces noise and amplifies the signal. The details are: 1) Compared with LSI and ChromVAR, the present invention can The sparse cell*peak matrix is transformed into a denser cell*accesson matrix, which reduces the noise signal in subsequent analysis; 2) Compared with the Cicero method based on chromatin position for peak merging, the present invention uses mathematical distance and clustering Algorithm to cluster the peaks and merge them. The peaks clustered together by this method have similar expression patterns. Therefore, the construction of accesson is more biologically meaningful. For example, the peaks within an accesson may be regulated by the same transcription factor, or closer in the three-dimensional structure of chromatin. Therefore, the *accesson matrix of transformed cells further amplifies the heterogeneity of cells.

本发明还提出一种基于峰聚类的单细胞染色质可及性测序数据分析装置，包括：The present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:

处理器；processor;

本发明还提出一种存储指令的计算机可读存储介质，所述指令在由处理器执行时使得所述处理器执行所述分析方法。The present invention also provides a computer-readable storage medium storing instructions, which when executed by a processor cause the processor to execute the analysis method.

需要说明的是，本发明中各功能模块/单元都可以是硬件，比如该硬件可以是电路，包括数字电路，模拟电路等等。硬件结构的物理实现包括但不局限于物理器件，物理器件包括但不局限于晶体管，忆阻器等等。数据处理模块可以是任何适当的硬件处理器，比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质，比如RRAM，DRAM，SRAM，EDRAM，HBM，HMC等等。It should be noted that each functional module/unit in the present invention can be hardware, for example, the hardware can be a circuit, including a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to physical devices, which includes but is not limited to transistors, memristors, and so on. The data processing module can be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. The storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将装置的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, namely The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. Within the spirit and principle of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims

一种基于峰聚类的单细胞染色质可及性测序数据分析方法，包括：A single-cell chromatin accessibility sequencing data analysis method based on peak clustering includes:

将单细胞染色质可及性测序数据与相应的生物样本基因组数据进行比对获得比对结果，并在所述比对结果的基础上寻峰，并计算每个峰内读数，得到细胞*峰的读数矩阵；Compare the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain the comparison result, and find the peak based on the comparison result, and calculate the reading within each peak to obtain the cell*peak The reading matrix;

计算细胞*峰的读数矩阵中峰与峰之间的数学距离，将峰聚类，并将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵，其中accesson为聚类后的峰。Calculate the mathematical distance between the peaks in the cell*peak reading matrix, cluster the peaks, and merge the cell*peak reading matrix into the cell*accesson reading matrix, where accesson is the clustered peak.
根据权利要求1所述的分析方法，其中，所述方法还包括将所述细胞*accesson的读数矩阵降维为二位可视化矩阵，优选地，降维的方法包括PCA、T-SNE或UMAP.The analysis method according to claim 1, wherein the method further comprises dimensionality reduction of the reading matrix of the cell *accesson to a two-digit visualization matrix, preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.
根据权利要求1或2所述的分析方法，其中，所述方法还包括根据所述细胞*accesson的读数矩阵对细胞进行聚类，优选地，聚类算法包括KNN聚类、kernel聚类或louvain聚类。The analysis method according to claim 1 or 2, wherein the method further comprises clustering the cells according to the reading matrix of the cell *accesson, preferably, the clustering algorithm comprises KNN clustering, kernel clustering or louvain Clustering.
根据权利要求1-3中任一项所述的分析方法，其中，所述方法还包括利用所述细胞*accesson的读数矩阵构建细胞发育路径假时间情况，优选地，构建细胞发育路径假时间情况时所用算法包括SPRING或monocle。The analysis method according to any one of claims 1 to 3, wherein the method further comprises using the read matrix of the cell *accesson to construct a false time situation of the cell development path, preferably, construct a false time situation of the cell development path The algorithm used here includes SPRING or monocle.
根据权利要求1-4中任一项所述的分析方法，其中，所述数学距离包括欧氏距离、皮尔逊相关系数或cityblock距离。The analysis method according to any one of claims 1 to 4, wherein the mathematical distance includes Euclidean distance, Pearson correlation coefficient or cityblock distance.
根据权利要求1-5中任一项所述的分析方法，其中，所述峰聚类的方法包括KNN、DBSAN或K-Mean。The analysis method according to any one of claims 1 to 5, wherein the method of peak clustering comprises KNN, DBSAN or K-Mean.
根据权利要求1-6中任一项所述的分析方法，其中，将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵的方法包括取accesson中峰读数的和、峰读数的平均值、峰读数的中位数或峰读数的方差。The analysis method according to any one of claims 1-6, wherein the method of combining the reading matrix of cell*peak into the reading matrix of cell*accesson comprises taking the sum of the peak readings in the accesson, the average of the peak readings, The median of the peak readings or the variance of the peak readings.
一种基于峰聚类的单细胞染色质可及性测序数据分析***，包括预处理模块和accesson构建模块；A single-cell chromatin accessibility sequencing data analysis system based on peak clustering, including a preprocessing module and an accesson building module;

其中，预处理模块包括a)比对单元，用于将单细胞染色质可及性测序数据与相应的生物样本基因组数据进行比对获得比对结果；b)寻峰单元，用于将所有单细胞的比对结果合并，然后寻峰；c)读数计算单元，计算每个峰内的读数，得到细胞*峰的读数矩阵；Among them, the preprocessing module includes a) a comparison unit, which is used to compare single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result; b) a peak finding unit, which is used to compare all single cells The comparison results of the cells are combined, and then the peak is searched; c) The reading calculation unit calculates the readings in each peak to obtain the reading matrix of the cell*peak;

accesson构建模块包括a)峰距离计算单元，用于计算细胞*峰的读数矩阵中峰与峰之间的数学距离；b)峰聚类单元，用于根据峰与峰之间的数学距离将峰聚类；c)矩阵转换单元，用于将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵，其中accesson为聚类后的峰。The accesson building module includes a) a peak distance calculation unit, used to calculate the mathematical distance between peaks in the cell*peak reading matrix; b) a peak clustering unit, used to cluster peaks based on the mathematical distance between peaks C) A matrix conversion unit for combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson, where the accesson is the peak after clustering.
根据权利要求8所述的分析***，其中，所述***还包括可视化模块，用于将所述细胞*accesson的读数矩阵降维为二位可视化矩阵，优选地，降维的方法包括PCA、T-SNE或UMAP。The analysis system according to claim 8, wherein the system further comprises a visualization module for reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T -SNE or UMAP.
根据权利要求8或9所述的分析***，其中，所述***还包括细胞聚类模块，用于根据所述细胞*accesson的读数矩阵对细胞进行聚类，优选地，聚类算法包括KNN聚类、kernel聚类或louvain聚类。The analysis system according to claim 8 or 9, wherein the system further comprises a cell clustering module for clustering the cells according to the reading matrix of the cell *accesson, preferably, the clustering algorithm includes KNN clustering Class, kernel clustering or louvain clustering.
根据权利要求8-10中任一项所述的分析***，其中，所述***还包括细胞发育路径重塑模块，用于利用所述细胞*accesson的读数矩阵构建细胞发育路径假时间情况，优选地，构建细胞发育路径假时间情况时所用算法包括SPRING或monocle。The analysis system according to any one of claims 8-10, wherein the system further comprises a cell development path remodeling module for constructing a false time situation of a cell development path using the reading matrix of the cell *accesson, preferably In particular, the algorithms used to construct false-time conditions of cell development paths include SPRING or monocle.
根据权利要求8-11中任一项所述的分析***，其中，所述数学距离包括欧氏距离、皮尔逊相关系数或cityblock距离。The analysis system according to any one of claims 8-11, wherein the mathematical distance includes Euclidean distance, Pearson correlation coefficient or cityblock distance.
根据权利要求8-12中任一项所述的分析***，其中，所述峰聚类的方法包括KNN、DBSAN或K-Mean。The analysis system according to any one of claims 8-12, wherein the method of peak clustering comprises KNN, DBSAN or K-Mean.
根据权利要求8-13中任一项所述的分析***，其中，将细胞*峰的读数矩阵合并为细胞*accesson的读数矩阵的方法包括取accesson中峰读数的和、峰读数的平均值、峰读数的中位数或峰读数的方差。The analysis system according to any one of claims 8-13, wherein the method of combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson comprises taking the sum of the peak readings in the accesson, the average of the peak readings, The median of the peak readings or the variance of the peak readings.
一种基于峰聚类的单细胞染色质可及性测序数据分析装置，包括：A single-cell chromatin accessibility sequencing data analysis device based on peak clustering includes:

处理器；processor;

存储器，其上存储有指令，所述指令在由所述处理器执行时使得所述处理器执行权利要求1-7中任一项所述的分析方法。A memory having instructions stored thereon, and when the instructions are executed by the processor, the processor executes the analysis method according to any one of claims 1-7.
一种存储指令的计算机可读存储介质，所述指令在由处理器执行时使得所述处理器执行权利要求1-7中任一项所述的分析方法。A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to execute the analysis method according to any one of claims 1-7.