WO2023029044A1 - Single-cell sequencing method and apparatus, and device, medium and program product - Google Patents

Single-cell sequencing method and apparatus, and device, medium and program product Download PDF

Info

Publication number
WO2023029044A1
WO2023029044A1 PCT/CN2021/116704 CN2021116704W WO2023029044A1 WO 2023029044 A1 WO2023029044 A1 WO 2023029044A1 CN 2021116704 W CN2021116704 W CN 2021116704W WO 2023029044 A1 WO2023029044 A1 WO 2023029044A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
threshold
subset
clusters
nucleotide sequences
Prior art date
Application number
PCT/CN2021/116704
Other languages
French (fr)
Chinese (zh)
Inventor
韩仁敏
高欣
祁俊海
Original Assignee
百图生科(北京)智能技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百图生科(北京)智能技术有限公司 filed Critical 百图生科(北京)智能技术有限公司
Priority to PCT/CN2021/116704 priority Critical patent/WO2023029044A1/en
Priority to CN202111481203.3A priority patent/CN114171117B/en
Publication of WO2023029044A1 publication Critical patent/WO2023029044A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present disclosure relates to the technical field of single-cell sequencing, and in particular to methods, devices, electronic equipment, computer-readable storage media and computer program products for single-cell sequencing.
  • Single-cell sequencing technology refers to a new technology for high-throughput sequencing analysis of genome, transcriptome, and epigenome at the level of a single cell. It can reveal the gene structure and gene expression state of a single cell, and reflect the heterogeneity among cells. It plays an important role in the fields of tumor, developmental biology, microbiology, neuroscience, etc., and is becoming the focus of life science research. In related technologies, there is still a lot of room for improvement in the study of single-cell sequencing.
  • a method for single-cell sequencing comprising:
  • the nanopore sequencing signal determine the merger threshold; based on the second similarity threshold, perform the first clustering on multiple nucleotide sequences to obtain the second multiple clusters, the first similarity threshold is greater than the second similarity threshold and performing clustering optimization on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
  • an apparatus for single-cell sequencing including a module for implementing the above method.
  • an electronic device including: at least one processor; and at least one memory communicatively connected to the at least one processor. At least one memory stores instructions, and the instructions, when executed by at least one processor, cause at least one processor to perform the above method.
  • a non-transitory computer-readable storage medium storing instructions. When executed by at least one processor of a computer, the instructions cause the computer to execute the above method.
  • a computer program product including a computer program, and the computer program implements the above method when executed by a processor.
  • FIG. 1 is a flow chart of a method for single-cell sequencing according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of an example process for performing a first clustering of sequences in the method of FIG. 1 according to an embodiment of the disclosure
  • FIG. 3 is a schematic diagram of an example process of performing a first clustering of sequences in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 4 is a flowchart of an example process of determining a merge threshold in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • 6A-6B are schematic diagrams of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 8 is a schematic diagram of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 9 is a schematic diagram of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 10 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure
  • FIG. 11 is a block diagram of an apparatus for single-cell sequencing according to an embodiment of the present disclosure.
  • FIG. 12 is a block diagram of an electronic device for single cell sequencing according to an embodiment of the disclosure.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, temporal relationship or importance relationship of these elements, and such terms are only used for Distinguishes one element from another.
  • first element and the second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on contextual description.
  • the single-cell identification strategy based on the label requires the "demultiplexing" of the sequence obtained by sequencing, that is, putting the sequence into the corresponding "box” according to the label, and the label in the DNA sequence in each box is the same Yes, these DNA sequences come from the same single cell.
  • Some demultiplexing methods have appeared now, and these methods mainly rely on deep learning. Deep learning methods are greatly dependent on data sets, and require users to train based on samples in advance, which is not universal.
  • Three-generation sequencing can provide two kinds of information: nanopore sequencing signals and their corresponding nucleotide sequences after translation.
  • DTW pairwise dynamic time warping
  • spectral clustering hierarchical clustering, k- means clustering
  • spectral clustering hierarchical clustering, k- means clustering
  • using nucleotide sequence information in combination with related clustering tools such as using CD-HIT to complete clustering is very fast, but the clustering accuracy is particularly poor.
  • the embodiments of the present disclosure simultaneously use two kinds of information (nanopore sequencing signal and nucleotide sequence information) provided by three-generation sequencing to carry out mixed clustering on nucleotide sequences and improve the clustering efficiency. class accuracy and extends the generality of clustering.
  • FIG. 1 is a flowchart of a method 100 for single-cell sequencing according to an embodiment of the present disclosure. As shown in FIG. 1 , the method 100 includes step 110 to step 150 .
  • a sequencing library can include multiple nucleotide sequences from multiple single cells.
  • a nanopore sequencing signal corresponding to a sequence can be obtained by using nanopore sequencing technology, and then a nucleotide sequence translated from the nanopore sequencing signal can be obtained.
  • the plurality of nucleotide sequences are first clustered to obtain a first plurality of clusters, the first plurality of clusters including the largest cluster with the largest cluster size.
  • the information of the multiple nucleotide sequences can be used to perform the first clustering on the multiple nucleotide sequences through a greedy clustering algorithm.
  • each cluster in the first plurality of clusters includes one or several nucleotide sequences. It should be known that the "cluster size" in this application refers to the number of nucleotide sequences included in a cluster.
  • the first plurality of clusters can be sorted by cluster size, and the largest cluster with the largest cluster size can be obtained.
  • a merge threshold is determined based on the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences and the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
  • a first clustering is performed on the plurality of nucleotide sequences based on a second similarity threshold to obtain a second plurality of clusters, wherein the first similarity threshold is greater than the second similarity threshold.
  • a higher first similarity threshold for example, 90%
  • the clustering result is used to calculate the merging threshold.
  • perform the first clustering again on the plurality of nucleotide sequences with a second similarity threshold for example, 85%
  • the information of the nanopore sequencing signal can be further used to refine the second plurality of clusters.
  • the sequence read lengths of nanopore sequencing signals corresponding to all the sequences can be obtained, and the average signal length is the average value of all sequence read lengths.
  • the merging threshold can be determined by using the nanopore sequencing signals corresponding to all the nucleotide sequences in the largest cluster.
  • the merging threshold can be a function of the nanopore sequencing signal corresponding to all nucleotide sequences in the largest cluster.
  • step 150 cluster optimization is performed on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
  • some clusters of the second plurality of clusters may be merged based on a merge threshold.
  • multiple refinements may also be performed on the merged clusters.
  • the third cluster after optimization is the final clustering result.
  • the method 100 performs mixed clustering on nucleotide sequences by utilizing two kinds of information (nanopore sequencing signal and nucleotide sequence) obtained by nanopore sequencing technology. First cluster the nucleotide sequences with the first similarity threshold, then combine the results of the first cluster and the nanopore sequencing signal to determine the merge threshold, and then perform clustering on the nucleotide sequences with the second similarity threshold The first clustering, finally using the merging threshold to merge and refine the results of the first clustering performed with the second similarity threshold. Therefore, compared to the clustering results using only nucleotide sequence information, the method 100 improves the clustering accuracy; and relative to the clustering results using only nanopore sequencing signals, the method 100 improves the clustering efficiency. In addition, the method 100 does not need to use a large number of samples for training, but only needs to input two kinds of corresponding information, so it has high versatility.
  • FIG. 2 is a flowchart of an example process of first clustering sequences in method 100 of FIG. 1 according to an embodiment of the disclosure.
  • the first clustering of multiple nucleotide sequences includes steps 210 to 270 .
  • Steps 210 to 270 are an iterative process as a whole, that is, continue to execute steps 210 to 270 until the set of nucleotide sequences to be clustered among the multiple nucleotide sequences is empty.
  • a representative sequence of the set of nucleotide sequences to be clustered is determined.
  • the nucleotide sequences to be clustered are multiple nucleotide sequences themselves.
  • determining a representative sequence of the set of nucleotide sequences to be clustered includes determining a nucleotide sequence having the longest length in the set of nucleotide sequences to be clustered as the representative sequence.
  • the nucleotide sequences to be clustered can be sorted in descending order of sequence length, and then the nucleotide sequence with the longest length is selected as a representative sequence.
  • step 220 the set of nucleotide sequences to be clustered is filtered by a short word filter.
  • a short word filter can filter out sequences with a shorter length. By first filtering with a short term filter, the number of subsequent pairwise alignments of sequences can be reduced.
  • step 230 it is judged whether the filtered set of nucleotide sequences to be clustered is an empty set.
  • step 240 in response to the filtered set of nucleotide sequences to be clustered is non-empty, for each nucleotide sequence in the filtered set of nucleotide sequences to be clustered: determine the nucleotide sequence and representative similarity between sequences.
  • the similarity between two different sequences can be determined by sequence alignment.
  • sequence consensus algorithm of BLAST or the gap compression algorithm can be used.
  • the nucleotide sequence is added to a similarity cluster comprising representative sequences.
  • the nucleotide sequences whose similarity is greater than the preset first similarity threshold are added to the similarity cluster where the representative sequence is located .
  • all the sequences greater than the first similarity threshold are added to the similarity cluster where the representative sequence is located, and then the generation of a cluster is completed.
  • step 260 in response to the filtered set of nucleotide sequences to be clustered is an empty set, a representative sequence is added to the short word cluster.
  • the representative sequence itself is regarded as a short word cluster.
  • step 270 the nucleotide sequences in the similarity cluster and the short word cluster are removed from the set of nucleotide sequences to be clustered, so as to update the set of nucleotide sequences to be clustered.
  • the generated similarity clusters that is, clustered
  • the updated set of nucleotide sequences to be clustered will return to the initial step 210 of the iteration to redefine representative sequences and generate another similarity cluster.
  • each similarity cluster and each short word cluster obtained in the iterative process form a second plurality of clusters. It should be known that, for the first clustering with the second similarity threshold, the above steps 210 to 270 are also applicable, and it is only necessary to replace the first similarity threshold in step 250 with the second similarity threshold.
  • the embodiments of the present disclosure can use the information of nucleotide sequence to perform the first clustering according to different similarity thresholds, and use the short word filter to improve the clustering efficiency during the clustering process.
  • steps 210 to 270 may be implemented by Algorithm 1 as follows:
  • N represents the set of nucleotide sequences to be clustered
  • NS represents the set of nucleotide sequences filtered out by the short word filter
  • S represents the set of nucleotide sequences to be clustered after filtering
  • center now represents represents the sequence
  • identity represents the similarity threshold (for example, it can represent the first similarity threshold or the second similarity threshold)
  • c represents the distance between a sequence x in the filtered nucleotide sequence set to be clustered and the representative sequence center now
  • Clusters means multiple first clusters.
  • FIG. 3 is a schematic diagram of an example process of first clustering sequences in the method 100 of FIG. 1 according to an embodiment of the disclosure.
  • the set of nucleotide sequences to be clustered 310 firstly, they are sorted in descending order according to the length of each nucleotide sequence, and the sorted set 320 is obtained. Then select the nucleotide sequence with the longest length in the set 320 , that is, the first nucleotide sequence, as the representative sequence 321 .
  • the set of nucleotide sequences to be clustered 310 or the sorted set 320 is input to a short word filter 330 to filter out short-length nucleotide sequences, such as the sequence 326 . Further, through the sequence comparison module 340, the similarity between each nucleotide sequence in the filtered nucleotide sequence set to be clustered and the representative sequence 321 is calculated. For example, the similarity between the sequence 322 and the representative sequence 321 is greater than or equal to the first similarity threshold, while the similarities between the sequences 323, 324 and 325 and the representative sequence 321 are all smaller than the first similarity threshold.
  • sequence 322 is added to the similarity cluster 350 including the representative sequence 321 and the sequence 322 is removed from the set of nucleotide sequences to be clustered 310 to obtain an updated sequence 360 of the set of nucleotide sequences to be clustered.
  • the updated set sequence 360 of nucleic acid sequences to be clustered is continuously iterated 370 until the set sequence 360 of updated nucleic acid sequences to be clustered is an empty set.
  • each similarity cluster 380-1, 380-2 through 380-k and each short word cluster 390-1, 390-2 through 390-t form a first plurality of clusters.
  • FIG. 4 is a flowchart of an example process of determining a merge threshold in method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 4 , determining the merge threshold (step 130 ) includes steps 410 to 430 .
  • a first threshold nanopore sequencing signal is randomly selected from the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
  • the number of nanopore sequencing signals to be selected from the largest cluster can be determined according to the size of the largest cluster.
  • the first threshold in response to determining that the maximum cluster size is greater than the second threshold, the first threshold is the second threshold. In another example, in response to determining that the maximum cluster size is less than or equal to the second threshold, the first threshold is the maximum cluster size.
  • step 420 a first dynamic time warping distance between every two nanopore sequencing signals among the first threshold number of nanopore sequencing signals is calculated.
  • the dynamic time warping (DTW) distance between the two signals can be calculated.
  • a merging threshold is determined based on the sum of the first dynamic time warping distances, the average signal length of the nanopore sequencing signals corresponding to the multiple nucleotide sequences, and the maximum cluster size.
  • the method for calculating the combining threshold may be determined according to the average signal length.
  • the merging threshold is a function of the mean of all DTW distances between any pair of signals.
  • step 410 to step 430 may be implemented by Algorithm 2 as follows:
  • a high similarity threshold (for example, 90%) can be set to run the first clustering method shown in Algorithm 1 to obtain the first plurality of clusters.
  • MaxCluster represents the largest cluster
  • MaxLength represents the maximum cluster size
  • AveLenSig represents the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences
  • sum represents the sum of the first dynamic time warping distances
  • sum/MaxLength Denotes the mean of all first dynamic time warping distances
  • Threshold denotes the merge threshold.
  • the second threshold is denoted as 10. When the maximum cluster size MaxLength is greater than the second threshold (10), then the first threshold is equal to the second threshold (10).
  • the first threshold (10) sequences are randomly selected from the MaxCluster.
  • the maximum cluster size MaxLength is less than or equal to the second threshold (10)
  • the first threshold is equal to MaxLength.
  • the first threshold (MaxLength) sequence is randomly selected from the maximum cluster MaxCluster.
  • the merge threshold Threshold can also be expressed as:
  • Threshold sum/MaxLength+c
  • c is a constant and can be determined by constructing a large number of simulation data sets for testing.
  • the embodiments of the present disclosure on the basis of first clustering the information of multiple nucleotide sequences using a high similarity threshold, the nanopore sequencing signals corresponding to the multiple nucleotide sequences are used to determine Merge Threshold.
  • the similarity between every two signals is measured by the first dynamic time warping distance. Therefore, the embodiments of the present disclosure combine two kinds of information, which can be used to improve the accuracy of clustering.
  • FIG. 5 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 5 , performing clustering optimization on the second plurality of clusters based on the merging threshold (step 140 ) includes steps 510 to 540 .
  • a first subset of the second plurality of clusters is determined, each cluster in the first subset having a cluster size greater than a third threshold.
  • the second plurality of clusters may be classified into good clusters (first subset) and bad clusters based on a cluster size of each cluster and a third threshold. For example, if the cluster size of the cluster is greater than a third threshold (for example, it may be set to 5), then the cluster belongs to the first subset.
  • the third threshold for classification may be determined according to the maximum cluster size.
  • determining the first subset includes: in response to determining that the largest cluster size is greater than a third threshold, determining clusters of the second plurality of clusters greater than a third threshold to form the first subset; and in response to The maximum cluster size is determined to be less than or equal to a third threshold, and clusters of the second plurality of clusters equal to the maximum cluster size are determined to form the first subset.
  • step 520 for each cluster in the first subset: randomly select a fourth threshold nanopore sequencing signal from nanopore sequencing signals corresponding to each nucleotide sequence in the cluster.
  • step 530 the ratio of each of the fourth threshold nanopore sequencing signals randomly selected from the cluster to the fourth threshold randomly selected nanopore sequencing signals from another cluster in the first subset is calculated The corresponding second dynamic time warping distance between .
  • step 540 in response to determining that the respective second dynamic time warping distances are both less than the merge threshold, the cluster and another cluster are merged to obtain a merged first subset.
  • nanopore sequencing signals corresponding to the fourth threshold sequence may be randomly selected from each cluster in the first subset.
  • the fourth threshold can be set to three, for example.
  • step 510 to step 540 may be implemented through Algorithm 3 as follows:
  • the function G ET M ER T H CFS IGNAL can implement step 510 to step 530 .
  • MaxLength represents the maximum cluster size
  • the fifth threshold is set to 5. It can be seen that when the maximum cluster size MaxLength is greater than the third threshold (5), those clusters whose cluster size is greater than the third threshold (5) are selected from the second plurality of clusters to form the first subset GoodCluster. On the other hand, when the maximum cluster size MaxLength is less than or equal to the third threshold (5), those clusters whose cluster size is equal to the maximum cluster size MaxLength are selected from the second plurality of clusters to form the first subset Good Cluster. Further, after the first subset GoodCluster is merged through step 530 and step 540, the merged first subset RefineGoodCluster can be obtained.
  • FIG. 6A-6B are schematic diagrams of an example process of cluster optimization in the method 100 of FIG. 1 , according to an embodiment of the present disclosure.
  • FIG. 6A shows the second plurality of clusters 610 , 620 , 630 , 640 , 650 and 660 obtained after the first clustering of 30 nucleotide sequences with the second similarity threshold. Sequences with the same texture in each sequence in the figure are located in the same cluster. For example, nucleotide sequences 641 , 642 and 643 are located in cluster 640 .
  • a first subset 670 of the plurality of first clusters 610 to 660 is determined.
  • the clusters 610 , 630 and 660 with a cluster size larger than 5 form the first subset 670 .
  • clusters 620 , 640 and 650 with a cluster size smaller than 5 do not belong to the first subset 670 .
  • FIG. 6B illustrates example operations for merging optimization on each of the clusters 610 , 630 , and 660 in the first subset 670 .
  • the fourth threshold set to 3 in FIG. 6B
  • sequences 611 , 612 and 613 are first randomly selected from the cluster 610 .
  • three sequences 631 , 632 and 633 are randomly selected from cluster 620 .
  • the second dynamic time warping distances between the nanopore sequencing signals corresponding to the sequences from different clusters are calculated respectively.
  • the second dynamic time warping distances between the sequence 611 and the sequence 631, between the sequence 611 and the sequence 632, and between the sequence 611 and the sequence 633 are respectively calculated. Similar operations are also performed for sequence 612 and sequence 613 .
  • judge 680 whether all the second dynamic time warping distances are smaller than the combining threshold. For example, all the second dynamic time warping distances between the sequences 611 , 612 and 613 and the sequences 631 , 632 and 633 are smaller than the merging threshold, then the cluster 610 and the cluster 630 are merged to obtain the cluster 690 .
  • sequences 661 , 662 and 663 are randomly picked from cluster 660 .
  • the embodiments of the present disclosure can use the nanopore sequencing signal to merge some of the clusters in the first cluster, thereby improving the clustering accuracy.
  • FIG. 7 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 7 , cluster optimization (step 140 ) further includes steps 710 to 740 .
  • a consensus sequence signal corresponding to each cluster in the merged first subset is determined.
  • the consensus sequence of a cluster can be determined first, and then the nanopore sequencing signal of the corresponding consensus sequence can be determined to obtain the consensus sequence signal.
  • step 720 for each nucleotide sequence included in the second subset: for each consensus sequence signal: calculate the first distance between the nanopore sequencing signal corresponding to the nucleotide sequence and the consensus sequence signal Three dynamic time warping distances. For each sequence in the second subset, respective third dynamic time warping distances between that sequence and all consensus sequence signals are calculated.
  • step 730 in response to determining that the third dynamic time warping distance is less than the merging threshold, adding the nucleotide sequence to a cluster in the merged first subset corresponding to the consensus sequence signal to update the merged first subset set.
  • the third dynamic time warping distance between a sequence in the second subset and a consensus sequence signal is smaller than the merging threshold, the sequence is added to the cluster corresponding to the consensus sequence signal.
  • the nucleotide sequences added to the merged first subset are removed from the second subset to update the second subset.
  • step 710 to step 740 may be implemented by Algorithm 4 as follows:
  • RefineGoodCluster represents the merged first subset
  • OSS represents the second subset
  • InitialCFSignalSet represents the set of consensus sequence signals of each cluster in the first subset RefineGoodCluster
  • threshold represents the merge threshold
  • FIG. 8 is a schematic diagram of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the present disclosure.
  • the merged first subset includes cluster 810 and cluster 820 .
  • the consensus sequence signal 811 for cluster 810 and the consensus sequence signal 821 for cluster 820 are respectively determined.
  • the second subset 830 includes 9 nucleotide sequences. Taking the sequence 831 as an example, calculate the third dynamic time warping distance (denoted by d 1 and d 2 respectively) between the nanopore sequencing signal corresponding to the sequence 831 and the consensus sequence signal 811 and the consensus sequence signal 821 . Then the comparator 840 judges whether d 1 and d 2 are smaller than the merge threshold.
  • the sequence 831 is added to the cluster 810 corresponding to the consensus sequence signal 811, so as to update the cluster 810 to be the cluster 810'.
  • the sequence 833 is added to the cluster 820 corresponding to the consensus sequence 821, and a new Cluster 820 is cluster 820'. Accordingly, the second subset 830 will remove sequences 832 and 833, for example.
  • the second subset 830 retains the sequence 832 . Finally, the updated second subset 830' is obtained.
  • the embodiment of the present disclosure utilizes the consensus sequence signal to further optimize the merged second plurality of clusters, thereby improving the clustering accuracy.
  • further cluster optimization may be performed in response to the updated second subset being non-empty.
  • clustering is performed on the updated second subset based on nanopore sequencing signals corresponding to each nucleotide sequence in the updated second subset to obtain at least one cluster.
  • the fourth dynamic time warping distance between the nanopore sequencing signals corresponding to every two nucleotide sequences in each cluster in at least one cluster is smaller than the merge threshold, and the updated merged first subset
  • the set and the at least one cluster form the third plurality of clusters.
  • Algorithm 4 can be referred to to implement the above steps.
  • G represents at least one cluster. For each sequence in at least one cluster G, the fourth dynamic time warping distance between any pair of them is smaller than the merging threshold.
  • FIG. 9 is a schematic diagram of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the present disclosure.
  • the updated second subset includes nucleotide sequences 910 , 920 , 930 and 940 .
  • the nucleotide sequence 910 firstly calculate the fourth dynamic time warping distances 921 , 931 and 941 between the sequence 910 and the nanopore sequencing signals corresponding to the nanopore sequencing signals and the sequences 920 , 930 and 940 .
  • the fourth dynamic time warping distances 921 , 931 and 941 are compared by a comparator 950 with a merge threshold.
  • Cluster 960 includes sequences 910 and 920 . Since the distances 931 and 941 are still greater than or equal to the merging threshold, next, a fourth dynamic time warping distance 943 between the nanopore sequencing signal corresponding to the sequence 930 and the nanopore sequencing signal corresponding to the sequence 940 is calculated. Then, the comparator 950 is used to determine whether the fourth dynamic time warping distance 943 is smaller than the combination threshold. When the fourth dynamic time warping distance 943 is smaller than the merging threshold, another new cluster 970 is generated. Cluster 970 includes sequences 930 and 940 .
  • the embodiments of the present disclosure can further optimize the nucleotide sequences that are not classified into the updated second cluster, thereby improving the clustering accuracy.
  • FIG. 10 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure.
  • the cluster optimization may further include steps 1010 to 1050 in response to a third subset of the second plurality of clusters other than the third plurality of clusters being non-empty.
  • steps 1010 to 1050 can be used as a checking mechanism for clustering results, for finding nucleotide sequences that have not been added to the cluster. For example, some nucleotide sequences are very short in length due to translation errors. Steps 1010 to 1050 can add such nucleotide sequences to the corresponding clusters.
  • step 1010 for each nucleotide sequence in the third subset: calculate the nanopore sequencing signal corresponding to the nucleotide sequence and the nanopore corresponding to a nucleotide sequence randomly selected from the third plurality of clusters Fifth dynamic time warping distance between hole signals.
  • step 1020 in response to determining that the fifth dynamic time warping distance is less than the merge threshold, adding the nucleotide sequence to a cluster of the third plurality of clusters comprising a randomly selected nucleotide sequence to update the third plurality of clusters set.
  • each nucleotide sequence added to the third plurality of clusters is removed from the third subset to update the third subset.
  • step 1010 to step 1030 may be implemented by Algorithm 5 as follows:
  • NN represents the third subset
  • ss represents a sequence randomly selected from the current third cluster Clusters now .
  • Algorithm 5 calculates the fifth dynamic time warping distance between the nanopore sequencing signal corresponding to each sequence nn in the third subset NN and ss, and judges whether the distance is smaller than the merging threshold Threshold. If it is less than the merge threshold Threshold, add nn to the Cluster where ss is located.
  • the cluster optimization in response to the updated third subset being non-empty, the cluster optimization further includes step 1040 and step 1050 .
  • each nucleotide sequence in the updated third subset is classified into a corresponding individual cluster.
  • each respective individual cluster is added to the updated third plurality of clusters.
  • the embodiment of the present disclosure also introduces a checking mechanism to perform clustering on nucleotide sequences that have not been added to the clustering, thereby ensuring the integrity of the clustering results.
  • the multiple nucleotide sequences are from multiple single cells, the nucleotide sequences from the same single cell have the same tag, and the nucleotide sequences from different single cells have different tags.
  • the embodiments of the present disclosure can not only be used for clustering without labels, but also can be used for clustering nucleotide sequences with labels, and directly associate the clustering results with labels.
  • the third plurality of clusters is associated with respective corresponding tags
  • the method 100 further includes: based on the third plurality of clusters and the corresponding tags associated with the third plurality of clusters, selecting from the plurality of clusters
  • the nucleotide sequences from each single cell in the plurality of single cells are isolated from the nucleotide sequences.
  • the sequences in the sequencing library are integrated with tags, and the tags reflect the cell of origin of the sequences.
  • labeled sequences can be clustered. After the clustering is completed, the nucleotide sequence of each single cell can be separated from multiple nucleotide sequences according to the clustering result. Therefore, the method 100 can separate the nucleotide sequence of each single cell based on the source of the single cell from a large number of nucleotide sequences mixed with multiple single cells, thereby improving the accuracy of single cell sequencing.
  • FIG. 11 is a block diagram of an apparatus 1100 for single-cell sequencing according to an embodiment of the present disclosure.
  • the single cell sequencing apparatus 1100 includes an acquisition module 1110 , a first similarity clustering module 1120 , a determination module 1130 , a second similarity clustering module 1140 and a clustering optimization module 1150 .
  • the acquiring module 1110 is configured to acquire multiple nucleotide sequences in the sequencing library and nanopore sequencing signals corresponding to the multiple nucleotide sequences.
  • the first similarity clustering module 1120 is configured to perform first clustering on a plurality of nucleotide sequences based on a first similarity threshold to obtain a first plurality of clusters, the first plurality of clusters includes the cluster with the largest The largest cluster set of set size.
  • the determining module 1130 is configured to determine the merge threshold based on the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences and the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
  • the second similarity clustering module 1140 is configured to perform first clustering on a plurality of nucleotide sequences based on a second similarity threshold to obtain a second plurality of clusters, the first similarity threshold being greater than the second similarity threshold
  • the cluster optimization module 1150 is configured to perform cluster optimization on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
  • the determination module 1130 includes a first selection submodule 1131 , a first calculation submodule 1132 and a first determination submodule 1133 .
  • the first selecting submodule 1131 is configured to randomly select a first threshold number of nanopore sequencing signals from the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
  • the first calculation sub-module 1132 is configured to calculate a first dynamic time warping distance between every two nanopore sequencing signals in the first threshold number of nanopore sequencing signals.
  • the first determination sub-module 1133 is configured to determine the merging threshold based on the sum of the first dynamic time warping distances, the average signal length of the nanopore sequencing signals corresponding to the multiple nucleotide sequences and the maximum cluster size.
  • the cluster optimization module 1150 includes a second determination submodule 1151 , a second selection submodule 1152 , a second calculation submodule 1153 and a merging submodule 1154 .
  • the second determination sub-module 1151 is configured to determine a first subset of the second plurality of clusters based on the maximum cluster size and the third threshold.
  • the second selecting submodule 1152 is configured to, for each cluster in the first subset: randomly select a fourth threshold nanopore sequencing signal from the nanopore sequencing signals corresponding to each nucleotide sequence in the cluster.
  • the second calculation sub-module 1153 is configured to calculate the ratio of each nanopore sequencing signal among the fourth threshold nanopore sequencing signals randomly selected from the cluster to the fourth threshold randomly selected from another cluster in the first subset. The corresponding second dynamic time warping distance between nanopore sequencing signals.
  • the merging sub-module 1154 is configured to, in response to determining that the respective second dynamic time warping distances are both smaller than the merging threshold, merge the cluster and the other cluster to obtain a merged first subset.
  • the cluster optimization module 1150 in response to the second subset of the second plurality of clusters excluding the merged first subset being non-empty, the cluster optimization module 1150 further includes a third determining submodule 1155, a third calculating A submodule 1156 , a first updating submodule 1157 and a second updating submodule 1158 .
  • the third determination sub-module 1155 is configured to determine the consensus sequence signal corresponding to each cluster in the merged first subset.
  • the third calculation submodule 1156 is configured to: for each nucleotide sequence included in the second subset: for each consensus sequence signal: calculate the nanopore sequencing signal and the consensus sequence signal corresponding to the nucleotide sequence The third dynamic time warping distance between .
  • the first update submodule 1157 is configured to add the nucleotide sequence to a cluster in the merged first subset corresponding to the consensus sequence signal in response to determining that the third dynamic time warping distance is less than the merge threshold, to The merged first subset is updated.
  • the second update sub-module 1158 is configured to remove from the second subset nucleotide sequences added to the merged first subset to update the second subset.
  • each module of the apparatus 1100 shown in FIG. 11 may correspond to each step in the method 100 described above with reference to FIGS. 1-10 .
  • the operations, features and advantages described above with respect to the method 100 are also applicable to the apparatus 1100 and the modules it includes. For the sake of brevity, some operations, features and advantages are not described in detail here.
  • a discussion herein of a particular module performing an action includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Accordingly, a particular module that performs an action may include the particular module that performs the action itself and/or another module that the particular module calls or otherwise accesses that performs the action.
  • an electronic device including: at least one processor; and at least one memory connected to the at least one processor in communication, the at least one memory stores instructions, and when the instructions are executed by the at least one processor , causing at least one processor to execute the above method.
  • a non-transitory computer-readable storage medium storing instructions. When executed by at least one processor of a computer, the instructions cause the computer to execute the above method.
  • a computer program product including a computer program, and the computer program implements the above method when executed by a processor.
  • FIG. 12 shows an example configuration of an electronic device 1200 that may be used to implement the methods described herein.
  • Electronic device 1200 may be various different types of devices. Examples of electronic device 1200 include, but are not limited to: desktop computers, server computers, notebook or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless telephones (e.g., smartphones), notepad computers, mobile stations), Wearable devices (eg, glasses, watches), entertainment devices (eg, entertainment appliances, set-top boxes communicatively coupled to display devices, game consoles), televisions or other display devices, automotive computers, and the like.
  • mobile devices e.g., tablet computers, cellular or other wireless telephones (e.g., smartphones), notepad computers, mobile stations), Wearable devices (eg, glasses, watches), entertainment devices (eg, entertainment appliances, set-top boxes communicatively coupled to display devices, game consoles), televisions or other display devices, automotive computers, and the like.
  • mobile devices e.g., tablet computers, cellular or other wireless telephones (e.g., smartphones), notepad computers, mobile stations)
  • Wearable devices eg, glasses,
  • Electronic device 1200 may include at least one processor 1202, memory 1204, communication interface(s) 1206, display device 1208, other input/output (I/O) devices capable of communicating with each other, such as through a system bus 1214 or other suitable connection. 1210 and one or more mass storage devices 1212.
  • processor 1202 memory 1204, communication interface(s) 1206, display device 1208, other input/output (I/O) devices capable of communicating with each other, such as through a system bus 1214 or other suitable connection. 1210 and one or more mass storage devices 1212.
  • the processor 1202 may be a single processing unit or multiple processing units, and all processing units may include single or multiple computing units or multiple cores.
  • Processor 1202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and/or any device that manipulates signals based on operational instructions.
  • processor 1202 may be configured to retrieve and execute computer-readable instructions stored in memory 1204, mass storage device 1212, or other computer-readable media, such as program code for operating system 1216, application programs 1218 program code of other programs 1220, etc.
  • Memory 1204 and mass storage device 1212 are examples of computer-readable storage media for storing instructions for execution by processor 1202 to implement the various functions described above.
  • memory 1204 may generally include both volatile and non-volatile memory (eg, RAM, ROM, etc.).
  • mass storage devices 1212 may generally include hard drives, solid state drives, removable media including external and removable drives, memory cards, flash memory, floppy disks, optical disks (eg, CD, DVD), storage arrays, network attached storage , storage area network and so on.
  • Both the memory 1204 and the mass storage device 1212 may be collectively referred to herein as a memory or a computer-readable storage medium, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program codes,
  • the computer program code may be executed by the processor 1202 as a specific machine configured to implement the operations and functions described in the examples herein.
  • Programs may be stored on mass storage device 1212 . These programs include operating system 1216, one or more application programs 1218, other programs 1220, and program data 1222, and they may be loaded into memory 1204 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) acquisition module 1110, first clustering module 1120, determination module 1130, and clustering optimization for implementing the following components/functions: Module 1140, method 100 (including any suitable steps of method 100), and/or additional embodiments described herein.
  • computer program logic e.g., computer program code or instructions
  • modules 1216 , 1218 , 1220 , and 1222 may be implemented using any form of computer-readable media that is accessible by electronic device 1200 .
  • “computer-readable media” includes at least two types of computer-readable media, namely, computer-readable storage media and communication media.
  • Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented by any method or technology for storage of information, such as computer-readable instructions, data structures, program module or other data.
  • Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology, CD-ROM, digital versatile disk (DVD), or other optical storage device, magnetic cartridge, tape, magnetic disk storage device, or other magnetic storage device, or any other non-transmission medium that can be used to store information for access by an electronic device.
  • communication media may embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism.
  • Computer-readable storage media as defined herein do not include communication media.
  • One or more communication interfaces 1206 are used to exchange data with other devices, such as over a network, direct connection, and the like.
  • Such communication interfaces may be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), wired or wireless (such as IEEE 802.11 wireless LAN (WLAN)) wireless interface, global microwave Access Interoperability (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth TM interface, Near Field Communication (NFC) interface, etc.
  • the communication interface 1206 can facilitate communication within a variety of networks and protocol types, including wired networks (eg, LAN, cable, etc.) and wireless networks (eg, WLAN, cellular, satellite, etc.), the Internet, and the like. Communication interface 1206 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and the like.
  • a display device 1208, such as a monitor may be included for displaying information and images to a user.
  • Other I/O devices 1210 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/ output devices, etc.
  • a cloud includes and/or represents a platform for resources.
  • the platform abstracts the underlying functionality of the cloud's hardware (eg, servers) and software resources.
  • Resources may include applications and/or data that may be used when computing processing is performed on a server remote from the electronic device 1200 .
  • Resources may also include services provided over the Internet and/or over a subscriber network, such as a cellular or Wi-Fi network.
  • the platform can abstract resources and functions to connect the electronic device 1200 with other electronic devices. Accordingly, implementation of the functionality described herein may be distributed throughout the cloud. For example, the functions may be implemented partly on the electronic device 1200 and partly through a platform that abstracts the functions of the cloud.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

Provided in the present invention is a single-cell sequencing method. The method comprises: acquiring a plurality of nucleotide sequences from a sequencing library, and nanopore sequencing signals which correspond to the plurality of nucleotide sequences; on the basis of a first similarity threshold value, performing first clustering on the plurality of nucleotide sequences, so as to obtain first multiple cluster sets, wherein the first multiple cluster sets comprise the largest cluster set having the largest cluster set size; determining a merge threshold value on the basis of the average signal length value of the nanopore sequencing signals which correspond to the plurality of nucleotide sequences, and a nanopore sequencing signal which corresponds to each nucleotide sequence in the largest cluster set; on the basis of a second similarity threshold value, performing first clustering on the plurality of nucleotide sequences, so as to obtain second multiple cluster sets, wherein the first similarity threshold value is greater than the second similarity threshold value; and on the basis of the merge threshold value, performing clustering optimization on the second multiple cluster sets, so as to obtain third multiple cluster sets.

Description

用于单细胞测序的方法、装置、设备、介质和程序产品Methods, devices, devices, media and procedural products for single-cell sequencing 技术领域technical field
本公开涉及单细胞测序技术领域,特别是涉及用于单细胞测序的方法、装置、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure relates to the technical field of single-cell sequencing, and in particular to methods, devices, electronic equipment, computer-readable storage media and computer program products for single-cell sequencing.
背景技术Background technique
随着二代以及三代测序技术的不断发展,引起了生物领域的巨大变革。最开始,研究者们从大量细胞中提取足够多的DNA样本,然后进行测序。这样的测序结果是这些DNA样本的“平均”结果。由于细胞异质性,相同表型的细胞的遗传信息可能存在显著性差异,很多低丰度的信息会在整体表征中丢失。为了弥补传统高通量测序的局限性,单细胞测序技术应运而生。With the continuous development of second-generation and third-generation sequencing technologies, great changes have been brought about in the biological field. Initially, researchers took enough DNA samples from a large number of cells to sequence them. Such sequencing results are "average" results for these DNA samples. Due to cell heterogeneity, there may be significant differences in the genetic information of cells of the same phenotype, and many low-abundance information will be lost in the overall representation. In order to make up for the limitations of traditional high-throughput sequencing, single-cell sequencing technology came into being.
单细胞测序技术是指在单个细胞水平上,对基因组、转录组、表观组进行高通量测序分析的一项新技术。它能够揭示单个细胞的基因结构和基因表达状态,反映细胞间的异质性,在肿瘤、发育生物学、微生物学、神经科学等领域发挥重要作用,正成为生命科学研究的焦点。在相关技术中,对于单细胞测序的研究还有很大的提高空间。Single-cell sequencing technology refers to a new technology for high-throughput sequencing analysis of genome, transcriptome, and epigenome at the level of a single cell. It can reveal the gene structure and gene expression state of a single cell, and reflect the heterogeneity among cells. It plays an important role in the fields of tumor, developmental biology, microbiology, neuroscience, etc., and is becoming the focus of life science research. In related technologies, there is still a lot of room for improvement in the study of single-cell sequencing.
发明内容Contents of the invention
提供一种缓解、减轻或者甚至消除上述问题中的一个或多个的机制将是有利的。It would be advantageous to provide a mechanism that alleviates, alleviates, or even eliminates one or more of the above-mentioned problems.
根据本公开的一方面,提供了一种用于单细胞测序的方法,包括:According to an aspect of the present disclosure, a method for single-cell sequencing is provided, comprising:
获取测序文库中的多条核苷酸序列和多条核苷酸序列对应的纳米孔测序信号;基于第一相似性阈值,对多条核苷酸序列进行第一聚类,以得到第一多个簇集,第一多个簇集包括具 有最大簇集尺寸的最大簇集;基于多条核苷酸序列对应的纳米孔测序信号的信号长度均值和最大簇集中的各核苷酸序列所对应的纳米孔测序信号,确定合并阈值;基于第二相似性阈值,对多条核苷酸序列进行第一聚类,以得到第二多个簇集,第一相似性阈值大于第二相似性阈值;以及基于合并阈值对所述第二多个簇集进行聚类优化,以得到第三多个簇集。Obtain multiple nucleotide sequences in the sequencing library and nanopore sequencing signals corresponding to the multiple nucleotide sequences; perform first clustering on the multiple nucleotide sequences based on the first similarity threshold to obtain the first multiple nucleotide sequences clusters, the first plurality of clusters includes the largest cluster with the largest cluster size; the signal length mean value of the nanopore sequencing signals corresponding to the plurality of nucleotide sequences and the corresponding nucleotide sequence in the largest cluster The nanopore sequencing signal, determine the merger threshold; based on the second similarity threshold, perform the first clustering on multiple nucleotide sequences to obtain the second multiple clusters, the first similarity threshold is greater than the second similarity threshold and performing clustering optimization on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
根据本公开的另一方面,提供了一种用于单细胞测序的装置,包括用于实现上述方法的模块。According to another aspect of the present disclosure, an apparatus for single-cell sequencing is provided, including a module for implementing the above method.
根据本公开的一个方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的至少一个存储器。至少一个存储器存储有指令,指令在被至少一个处理器执行时,使至少一个处理器执行上述的方法。According to one aspect of the present disclosure, an electronic device is provided, including: at least one processor; and at least one memory communicatively connected to the at least one processor. At least one memory stores instructions, and the instructions, when executed by at least one processor, cause at least one processor to perform the above method.
根据本公开的另一个方面,提供了一种存储有指令的非瞬时计算机可读存储介质,指令在被计算机的至少一个处理器执行时,使计算机执行上述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions. When executed by at least one processor of a computer, the instructions cause the computer to execute the above method.
根据本公开的另一个方面,提供了一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现上述的方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, and the computer program implements the above method when executed by a processor.
根据在下文中所描述的实施例,本公开的这些和其它方面将是清楚明白的,并且将参考在下文中所描述的实施例而被阐明。These and other aspects of the disclosure will be apparent from and will be elucidated with reference to the embodiments described hereinafter.
附图说明Description of drawings
在下面结合附图对于示例实施例的描述中,本公开的更多细节、特征和优点被公开,在附图中:Further details, features and advantages of the present disclosure are disclosed in the following description of example embodiments with reference to the accompanying drawings in which:
图1是根据本公开实施例的用于单细胞测序的方法的流程图;1 is a flow chart of a method for single-cell sequencing according to an embodiment of the present disclosure;
图2是根据本公开实施例的在图1的方法中对序列进行第一聚类的示例过程的流程图;2 is a flowchart of an example process for performing a first clustering of sequences in the method of FIG. 1 according to an embodiment of the disclosure;
图3是根据本公开实施例的在图1的方法中对序列进行第一聚类的示例过程的示意图;3 is a schematic diagram of an example process of performing a first clustering of sequences in the method of FIG. 1 according to an embodiment of the present disclosure;
图4是根据本公开实施例的在图1的方法中确定合并阈值的示例过程的流程图;FIG. 4 is a flowchart of an example process of determining a merge threshold in the method of FIG. 1 according to an embodiment of the present disclosure;
图5是根据本公开实施例的在图1的方法中聚类优化的示例过程的流程图;5 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图6A-6B是根据本公开实施例的在图1的方法中聚类优化的示例过程的示意图;6A-6B are schematic diagrams of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图7是根据本公开实施例的在图1的方法中聚类优化的示例过程的流程图;7 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图8是根据本公开实施例的在图1的方法中聚类优化的示例过程的示意图;8 is a schematic diagram of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图9是根据本公开实施例的在图1的方法中聚类优化的示例过程的示意图;9 is a schematic diagram of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图10是根据本公开实施例的在图1的方法中聚类优化的示例过程的流程图;10 is a flowchart of an example process of cluster optimization in the method of FIG. 1 according to an embodiment of the present disclosure;
图11是根据本公开实施例的用于单细胞测序的装置的框图;以及11 is a block diagram of an apparatus for single-cell sequencing according to an embodiment of the present disclosure; and
图12是根据本公开实施例的用于单细胞测序的电子设备的框图。12 is a block diagram of an electronic device for single cell sequencing according to an embodiment of the disclosure.
具体实施方式Detailed ways
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。In the present disclosure, unless otherwise stated, using the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, temporal relationship or importance relationship of these elements, and such terms are only used for Distinguishes one element from another. In some examples, the first element and the second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on contextual description.
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。如本文使用的,术语“多个”意指两个或更多,并且术语“基于”应解释为“至少部分地基于”。此外,术语“和/或”以及“......中的至少一个”涵盖所列出的项目中的任何一个以及全部可能的组合方式。The terminology used in describing the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, there may be one or more elements. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based at least in part on". In addition, the terms "and/or" and "at least one of" cover any one and all possible combinations of the listed items.
除非另有定义,本文中使用的所有术语(包括技术术语和科学术语)具有与本公开所属领域的普通技术人员所通常理解的相同含义。将进一步理解的是,诸如那些在通常使用的字典中定义的之类的术语应当被解释为具有与其在相关领域和/或本说明书上下文中的含义相一致的含义,并且将不在理想化或过于正式的意义上进行解释,除非本文中明确地如此定义。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms such as those defined in commonly used dictionaries should be interpreted to have meanings consistent with their meanings in the relevant field and/or in the context of this specification, and will not be idealized or overly be construed in a formal sense unless expressly so defined herein.
在相关技术中,主要有两种方法实现单细胞测序。第一种,将单个细胞分离出来,并独立构建测序文库,最终进行测序的路线。不过,将单细胞挨个分离出来再分别建库测序,通量非常低,这主要受成本的限制。In related technologies, there are mainly two methods to achieve single-cell sequencing. The first is to isolate a single cell, independently construct a sequencing library, and finally perform sequencing. However, the throughput of isolating single cells one by one and building libraries for sequencing is very low, which is mainly limited by cost.
为了克服这个困难,近年来多采取第二种策略:基于标签(barcode)的单细胞识别。它的主要思想是,给每个细胞加上独一无二的DNA序列,这样在测序的时候,就把携带相同标签的序列视为来自同一个细胞了。In order to overcome this difficulty, the second strategy has been adopted in recent years: single cell identification based on labels (barcode). Its main idea is to add a unique DNA sequence to each cell, so that when sequencing, the sequence carrying the same tag is regarded as coming from the same cell.
基于标签(barcode)的单细胞识别策略要求将测序所得的序列“解复用”,也就是根据标签将序列放进相应的“箱子”中,每个箱子中的DNA序列中的标签都是相同的,这些DNA序列来自同一个单细胞。现在已经出现了一些解复用的方法,这些方法主要依赖于深度学习,深度学习方法极大地依赖于数据集,需要用户提前基于样本进行训练,不具有通用性。The single-cell identification strategy based on the label (barcode) requires the "demultiplexing" of the sequence obtained by sequencing, that is, putting the sequence into the corresponding "box" according to the label, and the label in the DNA sequence in each box is the same Yes, these DNA sequences come from the same single cell. Some demultiplexing methods have appeared now, and these methods mainly rely on deep learning. Deep learning methods are greatly dependent on data sets, and require users to train based on samples in advance, which is not universal.
三代测序能够提供两种信息:纳米孔测序信号以及它们翻译之后对应的核苷酸序列。在相关技术中,直接计算纳米孔测序信号之间的两两动态时间规整(Dynamic Time Warping,DTW)距离矩阵,基于距离矩阵,结合相关的机器学习方法(谱聚类,层次聚类,k-means聚类)可以产生完美的聚类结果,但是聚类效率非常低。另外,使用核苷酸序列信息,结合相关的聚类工具(例如利用CD-HIT)完成聚类,速度非常快,但是聚类精度特别差。Three-generation sequencing can provide two kinds of information: nanopore sequencing signals and their corresponding nucleotide sequences after translation. In related technologies, directly calculate the pairwise dynamic time warping (Dynamic Time Warping, DTW) distance matrix between nanopore sequencing signals, based on the distance matrix, combined with related machine learning methods (spectral clustering, hierarchical clustering, k- means clustering) can produce perfect clustering results, but the clustering efficiency is very low. In addition, using nucleotide sequence information in combination with related clustering tools (such as using CD-HIT) to complete clustering is very fast, but the clustering accuracy is particularly poor.
如结合下面的描述将更加清楚的,本公开的实施例同时利用三代测序提供的两种信息(纳米孔测序信号和核苷酸序列信息),对核苷酸序列进行混合聚类,提高了聚类精度并扩展了聚类的通用性。As will be clearer in conjunction with the following description, the embodiments of the present disclosure simultaneously use two kinds of information (nanopore sequencing signal and nucleotide sequence information) provided by three-generation sequencing to carry out mixed clustering on nucleotide sequences and improve the clustering efficiency. class accuracy and extends the generality of clustering.
图1是根据本公开实施例的用于单细胞测序的方法100的流程图。如图1所示,方法100包括步骤110至步骤150。FIG. 1 is a flowchart of a method 100 for single-cell sequencing according to an embodiment of the present disclosure. As shown in FIG. 1 , the method 100 includes step 110 to step 150 .
在步骤110,获取测序文库中的多条核苷酸序列和多条核苷酸序列对应的纳米孔测序信号。在一个示例中,测序文库可以包括来自多个单细胞的多条核苷酸序列。在一个示例中,可以利用纳米孔测序技术获得一条序列对应的纳米孔测序信号,进而获得由纳米孔测序信号翻译之后的核苷酸序列。In step 110, multiple nucleotide sequences in the sequencing library and nanopore sequencing signals corresponding to the multiple nucleotide sequences are obtained. In one example, a sequencing library can include multiple nucleotide sequences from multiple single cells. In an example, a nanopore sequencing signal corresponding to a sequence can be obtained by using nanopore sequencing technology, and then a nucleotide sequence translated from the nanopore sequencing signal can be obtained.
在步骤120,基于第一相似性阈值,对多条核苷酸序列进行第一聚类,以得到第一多个簇集,第一多个簇集包括具有最大簇集尺寸的最大簇集。在一些示例性实施例中,可以仅利用多条核苷酸序列的信息,通过一种贪心聚类算法对多条核苷酸序列进行第一聚类。在一个示例中,第一多个簇集中的每个簇集包括一条或若干条核苷酸序列。应当知晓的是,本申请中的“簇集尺寸”指的是一个簇集中所包括的核苷酸序列的数量。在一个示例中,可以对第一多个簇集按照簇集尺寸进行排序,并且可以得到具有最大簇集尺寸的最大簇集。In step 120, based on the first similarity threshold, the plurality of nucleotide sequences are first clustered to obtain a first plurality of clusters, the first plurality of clusters including the largest cluster with the largest cluster size. In some exemplary embodiments, only the information of the multiple nucleotide sequences can be used to perform the first clustering on the multiple nucleotide sequences through a greedy clustering algorithm. In one example, each cluster in the first plurality of clusters includes one or several nucleotide sequences. It should be known that the "cluster size" in this application refers to the number of nucleotide sequences included in a cluster. In one example, the first plurality of clusters can be sorted by cluster size, and the largest cluster with the largest cluster size can be obtained.
在步骤130,基于多条核苷酸序列对应的纳米孔测序信号的信号长度均值和最大簇集中的各核苷酸序列所对应的纳米孔测序信号,确定合并阈值。In step 130, a merge threshold is determined based on the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences and the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
在步骤140,基于第二相似性阈值,对多条核苷酸序列进行第一聚类,以得到第二多个簇集,其中,第一相似性阈值大于第二相似性阈值。在一个示例中,以较高的第一相似性阈值(例如90%)来对多条核苷酸序列进行第一聚类,聚类的结果用于求取合并阈值。然后,以低于第一相似性阈值的第二相似性阈值(例如85%)再次对多条核苷酸序列进行第一聚类,以得到第二多个簇集。在得到第二多个簇集后,可以进一步利用纳米孔测序信号的信息来对第二多个簇集进行细化。在一些示例性实施例中,可以得到所有序列对应的纳米孔测序信号的序列读取长度,并且信号长度均值为所有的序列读取长度的均值。根据不同的信号长度均值,可以利用最大簇集中的所有核苷酸序列所对应的纳米孔测序信号来确定合并阈值。例如,合并阈值可以是最大簇集中的所有核苷酸序列所对应的纳米孔测序信号的函数。In step 140, a first clustering is performed on the plurality of nucleotide sequences based on a second similarity threshold to obtain a second plurality of clusters, wherein the first similarity threshold is greater than the second similarity threshold. In one example, a higher first similarity threshold (for example, 90%) is used to perform first clustering on multiple nucleotide sequences, and the clustering result is used to calculate the merging threshold. Then, perform the first clustering again on the plurality of nucleotide sequences with a second similarity threshold (for example, 85%) lower than the first similarity threshold to obtain a second plurality of clusters. After obtaining the second plurality of clusters, the information of the nanopore sequencing signal can be further used to refine the second plurality of clusters. In some exemplary embodiments, the sequence read lengths of nanopore sequencing signals corresponding to all the sequences can be obtained, and the average signal length is the average value of all sequence read lengths. According to different average signal lengths, the merging threshold can be determined by using the nanopore sequencing signals corresponding to all the nucleotide sequences in the largest cluster. For example, the merging threshold can be a function of the nanopore sequencing signal corresponding to all nucleotide sequences in the largest cluster.
在步骤150,基于合并阈值对第二多个簇集进行聚类优化,以得到第三多个簇集。在一些示例性实施例中,可以基于合并阈值对第二多个簇集中的一些簇集进行合并。在一些示例中,还可以对合并之后的簇集进行多次细化。优化之后的第三多个簇集为最终的聚类结果。In step 150, cluster optimization is performed on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters. In some exemplary embodiments, some clusters of the second plurality of clusters may be merged based on a merge threshold. In some examples, multiple refinements may also be performed on the merged clusters. The third cluster after optimization is the final clustering result.
综上所述,方法100通过利用纳米孔测序技术得到的两种信息(纳米孔测序信号以及核苷酸序列),对核苷酸序列进行混合聚类。首先以第一相似性阈值对核苷酸序列进行第一聚类,然后结合第一聚类的结果和纳米孔测序信号,确定合并阈值,接下来以第二相似性阈值对核苷酸序列进行第一聚类,最后利用合并阈值对以第二相似性阈值进行的第一聚类的结果进行合并和细化。因此,相对于仅使用核苷酸序列信息进行聚类的结果,方法100提高了聚类精度;并且相对于仅使用纳米孔测序信号进行聚类的结果,方法100 提高了聚类效率。此外,方法100不需要使用大量样本进行训练,只需要输入相应的两种信息,故而通用性高。To sum up, the method 100 performs mixed clustering on nucleotide sequences by utilizing two kinds of information (nanopore sequencing signal and nucleotide sequence) obtained by nanopore sequencing technology. First cluster the nucleotide sequences with the first similarity threshold, then combine the results of the first cluster and the nanopore sequencing signal to determine the merge threshold, and then perform clustering on the nucleotide sequences with the second similarity threshold The first clustering, finally using the merging threshold to merge and refine the results of the first clustering performed with the second similarity threshold. Therefore, compared to the clustering results using only nucleotide sequence information, the method 100 improves the clustering accuracy; and relative to the clustering results using only nanopore sequencing signals, the method 100 improves the clustering efficiency. In addition, the method 100 does not need to use a large number of samples for training, but only needs to input two kinds of corresponding information, so it has high versatility.
图2是根据本公开实施例的在图1的方法100中对序列进行第一聚类的示例过程的流程图。如图2所示,对多条核苷酸序列进行第一聚类(步骤120)包括步骤210至步骤270。步骤210至步骤270整体是一个迭代的过程,即,持续执行步骤210至步骤270直到多条核苷酸序列中的待聚类核苷酸序列集合为空。FIG. 2 is a flowchart of an example process of first clustering sequences in method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 2 , the first clustering of multiple nucleotide sequences (step 120 ) includes steps 210 to 270 . Steps 210 to 270 are an iterative process as a whole, that is, continue to execute steps 210 to 270 until the set of nucleotide sequences to be clustered among the multiple nucleotide sequences is empty.
在步骤210,确定待聚类核苷酸序列集合的代表序列。在第一次迭代中,待聚类核苷酸序列为多个核苷酸序列本身。在一些示例性实施例中,确定待聚类核苷酸序列集合的代表序列包括确定待聚类核苷酸序列集合中具有最长长度的核苷酸序列作为代表序列。在一个示例中,可以对待聚类核苷酸序列按照序列长度递减的顺序进行排序,然后选取具有最长长度的核苷酸序列作为代表序列。In step 210, a representative sequence of the set of nucleotide sequences to be clustered is determined. In the first iteration, the nucleotide sequences to be clustered are multiple nucleotide sequences themselves. In some exemplary embodiments, determining a representative sequence of the set of nucleotide sequences to be clustered includes determining a nucleotide sequence having the longest length in the set of nucleotide sequences to be clustered as the representative sequence. In one example, the nucleotide sequences to be clustered can be sorted in descending order of sequence length, and then the nucleotide sequence with the longest length is selected as a representative sequence.
在步骤220,利用短词滤波器过滤待聚类核苷酸序列集合。在一个示例中,短词滤波器可以过滤掉具有较短长度的序列。通过先使用短词滤波器进行过滤,可以减少随后序列成对对齐的次数。In step 220, the set of nucleotide sequences to be clustered is filtered by a short word filter. In one example, a short word filter can filter out sequences with a shorter length. By first filtering with a short term filter, the number of subsequent pairwise alignments of sequences can be reduced.
在步骤230,判断经过滤的待聚类核苷酸序列集合是否为空集。In step 230, it is judged whether the filtered set of nucleotide sequences to be clustered is an empty set.
在步骤240,响应于经过滤的待聚类核苷酸序列集合为非空,对于经过滤的待聚类核苷酸序列集合中的每一条核苷酸序列:确定该核苷酸序列与代表序列之间的相似性。在一个示例中,可以通过序列比对来确定两个不同序列之间的相似性。例如,可以通过BLAST的序列一致性算法或者缺口(gap)压缩算法等。In step 240, in response to the filtered set of nucleotide sequences to be clustered is non-empty, for each nucleotide sequence in the filtered set of nucleotide sequences to be clustered: determine the nucleotide sequence and representative similarity between sequences. In one example, the similarity between two different sequences can be determined by sequence alignment. For example, the sequence consensus algorithm of BLAST or the gap compression algorithm can be used.
在步骤250,响应于确定相似性大于或等于第一相似性阈值,添加该核苷酸序列到包括代表序列的相似性簇集。在一个示例中,确定了每条核苷酸序列与代表序列之间的相似性后,将相似性大于预设的第一相似性阈值的核苷酸序列添加到代表序列所在的相 似性簇集。在检查完经过滤的待聚类核苷酸序列集合中的所有序列之后,大于第一相似性阈值的序列全部被添加到代表序列所在的相似性簇集,进而完成一个簇集的生成。At step 250, in response to determining that the similarity is greater than or equal to the first similarity threshold, the nucleotide sequence is added to a similarity cluster comprising representative sequences. In one example, after determining the similarity between each nucleotide sequence and the representative sequence, the nucleotide sequences whose similarity is greater than the preset first similarity threshold are added to the similarity cluster where the representative sequence is located . After checking all the sequences in the filtered nucleotide sequence set to be clustered, all the sequences greater than the first similarity threshold are added to the similarity cluster where the representative sequence is located, and then the generation of a cluster is completed.
在步骤260,响应于经过滤的待聚类核苷酸序列集合为空集,添加代表序列到短词簇集。在一个示例中,若待聚类核苷酸序列集合中所有的序列均被短词滤波器过滤掉,则将代表序列本身作为一个短词簇集。In step 260, in response to the filtered set of nucleotide sequences to be clustered is an empty set, a representative sequence is added to the short word cluster. In one example, if all the sequences in the set of nucleotide sequences to be clustered are filtered out by the short word filter, the representative sequence itself is regarded as a short word cluster.
在步骤270,从待聚类核苷酸序列集合中移除相似性簇集和短词簇集中的核苷酸序列,以更新待聚类核苷酸序列集合。在一个迭代过程中,当通过步骤240和250完成一个相似性簇集的生成之后,从待聚类核苷酸序列集合中移除所生成的相似性簇集中的(即,已经完成聚类的)核苷酸序列。然后,更新后的待聚类核苷酸序列集合将返回迭代的初始步骤210,重新确定代表序列并生成另一个相似性簇集。多次迭代后,在待聚类核苷酸序列集合中仍存在被短词滤波器过滤掉的、无法参与相似性聚类的序列。对于这些序列,可以选择其中的代表序列以生成一个短词簇集,其中短词簇集只有代表序列本身。同样地,已经生成短词簇集的那些序列将从待聚类核苷酸序列集合中移除。反复迭代后,当待聚类核苷酸序列集合为空集时,停止迭代。此时,迭代过程中得到的各相似性簇集和各短词簇集形成第二多个簇集。应当知晓的是,对于以第二相似性阈值进行第一聚类,上述步骤210至步骤270同样适用,仅需要将步骤250中的第一相似性阈值替换为第二相似性阈值。In step 270, the nucleotide sequences in the similarity cluster and the short word cluster are removed from the set of nucleotide sequences to be clustered, so as to update the set of nucleotide sequences to be clustered. In an iterative process, after the generation of a similarity cluster is completed through steps 240 and 250, remove from the set of nucleotide sequences to be clustered the generated similarity clusters (that is, clustered ) nucleotide sequence. Then, the updated set of nucleotide sequences to be clustered will return to the initial step 210 of the iteration to redefine representative sequences and generate another similarity cluster. After multiple iterations, there are still sequences in the set of nucleotide sequences to be clustered that are filtered out by the short word filter and cannot participate in similarity clustering. For these sequences, representative sequences can be selected to generate a short word cluster, wherein the short word cluster only represents the sequence itself. Likewise, those sequences that have generated clusters of short words will be removed from the set of nucleotide sequences to be clustered. After repeated iterations, when the set of nucleotide sequences to be clustered is an empty set, the iteration is stopped. At this time, each similarity cluster and each short word cluster obtained in the iterative process form a second plurality of clusters. It should be known that, for the first clustering with the second similarity threshold, the above steps 210 to 270 are also applicable, and it is only necessary to replace the first similarity threshold in step 250 with the second similarity threshold.
综上所述,本公开的实施例能够利用核苷酸序列这一信息根据不同的相似性阈值进行第一聚类,并且在聚类过程中使用短词滤波器提高聚类效率。To sum up, the embodiments of the present disclosure can use the information of nucleotide sequence to perform the first clustering according to different similarity thresholds, and use the short word filter to improve the clustering efficiency during the clustering process.
在一些示例性实施例中,步骤210至步骤270可以通过如下所示的算法1来实现:In some exemplary embodiments, steps 210 to 270 may be implemented by Algorithm 1 as follows:
Figure PCTCN2021116704-appb-000001
Figure PCTCN2021116704-appb-000001
在算法1中,N表示待聚类核苷酸序列集合,NS表示被短词滤波器过滤掉的核苷酸序列的集合,S表示经过滤的待聚类核苷酸序列集合,center now表示代表序列,identity表示相似性阈值(例如可以表示第一相似性阈值或第二相似性阈值),c表示经过滤的待聚类核苷酸序列集合中的一条序列x与代表序列center now之间的相似性,cluster now表示相似性簇集或短词簇集,以及Clusters表示多个第一簇集。 In Algorithm 1, N represents the set of nucleotide sequences to be clustered, NS represents the set of nucleotide sequences filtered out by the short word filter, S represents the set of nucleotide sequences to be clustered after filtering, and center now represents represents the sequence, identity represents the similarity threshold (for example, it can represent the first similarity threshold or the second similarity threshold), c represents the distance between a sequence x in the filtered nucleotide sequence set to be clustered and the representative sequence center now The similarity of , cluster now means similarity clusters or short word clusters, and Clusters means multiple first clusters.
图3是根据本公开实施例的在图1的方法100中对序列进行第一聚类的示例过程的示意图。如图3所示,对于待聚类核苷酸序列集合310,首先按照各核苷酸序列的长度以递减的顺序进行排序,得到排序后的集合320。然后选取集合320中具有最长长度的核苷酸序列,即第一条核苷酸序列,以作为代表序列321。将待聚类核苷酸序列集合310或排序后的集合320输入到短词滤波器330,以过滤掉较短长度的核苷酸序列,例如序列326。进一步地,通过序列对比模块340,计算经过滤的待聚类核苷酸序列集合中的每条 核苷酸序列与代表序列321之间的相似性。例如,序列322与代表序列321之间的相似性大于或等于第一相似性阈值,而序列323、324和325与代表序列321之间的相似性均小于第一相似性阈值。因此,添加序列322到包括代表序列321的相似性簇集350并从待聚类核苷酸序列集合310中移除序列322以得到更新后的待聚类核酸序列集合序列360。接下来,更新后的待聚类核酸序列集合序列360进行不断迭代370,直到更新后的待聚类核酸序列集合序列360为空集。在迭代完成后,各相似性簇集380-1、380-2至380-k和各短词簇集390-1、390-2至390-t形成第一多个簇集。FIG. 3 is a schematic diagram of an example process of first clustering sequences in the method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 3 , for the set of nucleotide sequences to be clustered 310 , firstly, they are sorted in descending order according to the length of each nucleotide sequence, and the sorted set 320 is obtained. Then select the nucleotide sequence with the longest length in the set 320 , that is, the first nucleotide sequence, as the representative sequence 321 . The set of nucleotide sequences to be clustered 310 or the sorted set 320 is input to a short word filter 330 to filter out short-length nucleotide sequences, such as the sequence 326 . Further, through the sequence comparison module 340, the similarity between each nucleotide sequence in the filtered nucleotide sequence set to be clustered and the representative sequence 321 is calculated. For example, the similarity between the sequence 322 and the representative sequence 321 is greater than or equal to the first similarity threshold, while the similarities between the sequences 323, 324 and 325 and the representative sequence 321 are all smaller than the first similarity threshold. Therefore, the sequence 322 is added to the similarity cluster 350 including the representative sequence 321 and the sequence 322 is removed from the set of nucleotide sequences to be clustered 310 to obtain an updated sequence 360 of the set of nucleotide sequences to be clustered. Next, the updated set sequence 360 of nucleic acid sequences to be clustered is continuously iterated 370 until the set sequence 360 of updated nucleic acid sequences to be clustered is an empty set. After the iterations are complete, each similarity cluster 380-1, 380-2 through 380-k and each short word cluster 390-1, 390-2 through 390-t form a first plurality of clusters.
图4是根据本公开实施例的在图1的方法100中确定合并阈值的示例过程的流程图。如图4所示,确定合并阈值(步骤130)包括步骤410至步骤430。FIG. 4 is a flowchart of an example process of determining a merge threshold in method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 4 , determining the merge threshold (step 130 ) includes steps 410 to 430 .
在步骤410,从最大簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第一阈值个纳米孔测序信号。在一些示例性实施例中,可以根据最大簇集尺寸的大小来确定要从最大簇集中所选择的纳米孔测序信号的数量。在一个示例中,响应于确定最大簇集尺寸大于第二阈值,第一阈值为第二阈值。在另一个示例中,响应于确定最大簇集尺寸小于或等于第二阈值,第一阈值为最大簇集尺寸。In step 410, a first threshold nanopore sequencing signal is randomly selected from the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster. In some exemplary embodiments, the number of nanopore sequencing signals to be selected from the largest cluster can be determined according to the size of the largest cluster. In one example, in response to determining that the maximum cluster size is greater than the second threshold, the first threshold is the second threshold. In another example, in response to determining that the maximum cluster size is less than or equal to the second threshold, the first threshold is the maximum cluster size.
在步骤420,计算所第一阈值个纳米孔测序信号中每两个纳米孔测序信号之间的第一动态时间规整距离。在得知两个纳米孔测序信号的基础上,可以计算两个信号之间动态时间规整(DTW)距离。In step 420, a first dynamic time warping distance between every two nanopore sequencing signals among the first threshold number of nanopore sequencing signals is calculated. On the basis of knowing the two nanopore sequencing signals, the dynamic time warping (DTW) distance between the two signals can be calculated.
在步骤430,基于各第一动态时间规整距离的总和、多条核苷酸序列所对应的纳米孔测序信号的信号长度均值和所述最大簇集尺寸,确定合并阈值。在一些示例性实施例中,可以根据信号长度均值来确定计算合并阈值的方法。在一个示例中,合并阈值为两两信号之间的所有DTW距离的均值的函数。In step 430, a merging threshold is determined based on the sum of the first dynamic time warping distances, the average signal length of the nanopore sequencing signals corresponding to the multiple nucleotide sequences, and the maximum cluster size. In some exemplary embodiments, the method for calculating the combining threshold may be determined according to the average signal length. In one example, the merging threshold is a function of the mean of all DTW distances between any pair of signals.
在一些示例性实施例中,步骤410至步骤430可以通过如下所示的算法2来实现:In some exemplary embodiments, step 410 to step 430 may be implemented by Algorithm 2 as follows:
Figure PCTCN2021116704-appb-000002
Figure PCTCN2021116704-appb-000002
在算法2中,可以设置一个高相似性阈值(例如90%)来运行如算法1中所示的第一聚类方法,以得到第一多个簇集。其中,MaxCluster表示最大簇集,MaxLength表示最大簇集尺寸,AveLenSig表示多条核苷酸序列所对应的纳米孔测序信号的信号长度均值,sum表示各第一动态时间规整距离的总和,sum/MaxLength表示所有第一动态时间规整距离的均值,以及Threshold表示合并阈值。在算法2中,第二阈值表示为10。当最大簇集尺寸MaxLength大于第二阈值(10)时,则第一阈值等于第二阈值(10)。因此,从最大簇集MaxCluster中随机选择第一阈值(10)条序列。另一方面,当最大簇集尺寸MaxLength小于或等于第二阈值(10)时,则第一阈值等于MaxLength。此时,从最大簇集MaxCluster中随机选择第一阈值(MaxLength)条序列。In Algorithm 2, a high similarity threshold (for example, 90%) can be set to run the first clustering method shown in Algorithm 1 to obtain the first plurality of clusters. Among them, MaxCluster represents the largest cluster, MaxLength represents the maximum cluster size, AveLenSig represents the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences, sum represents the sum of the first dynamic time warping distances, sum/MaxLength Denotes the mean of all first dynamic time warping distances, and Threshold denotes the merge threshold. In Algorithm 2, the second threshold is denoted as 10. When the maximum cluster size MaxLength is greater than the second threshold (10), then the first threshold is equal to the second threshold (10). Therefore, the first threshold (10) sequences are randomly selected from the MaxCluster. On the other hand, when the maximum cluster size MaxLength is less than or equal to the second threshold (10), then the first threshold is equal to MaxLength. At this time, the first threshold (MaxLength) sequence is randomly selected from the maximum cluster MaxCluster.
进一步地,根据信号长度均值AveLenSig的不同,合并阈值Threshold采用不同的计算方法。在一个示例中,合并阈值Threshold还可以表示为:Further, different calculation methods are used for the combination threshold Threshold according to the difference of the average signal length AveLenSig. In an example, the merge threshold Threshold can also be expressed as:
Threshold=sum/MaxLength+cThreshold=sum/MaxLength+c
其中,c是一个常量,并且可以通过构建大量的仿真数据集进行测试来确定。Among them, c is a constant and can be determined by constructing a large number of simulation data sets for testing.
综上所述,本公开的实施例在利用高相似性阈值对多个核苷酸序列的信息进行第一聚类的基础,再利用多个核苷酸序列所对应的纳米孔测序信号来确定合并阈值。在确定合并阈值中,通过第一动态时间规整距离来衡量每两个信号之间的相似性。因此,本公开的实施例结合了两种信息,能够用于提升聚类的精度。To sum up, in the embodiments of the present disclosure, on the basis of first clustering the information of multiple nucleotide sequences using a high similarity threshold, the nanopore sequencing signals corresponding to the multiple nucleotide sequences are used to determine Merge Threshold. In determining the merging threshold, the similarity between every two signals is measured by the first dynamic time warping distance. Therefore, the embodiments of the present disclosure combine two kinds of information, which can be used to improve the accuracy of clustering.
图5是根据本公开实施例的在图1的方法100中的聚类优化的示例过程的流程图。如图5所示,基于合并阈值对第二多个簇集进行聚类优化(步骤140)包括步骤510至步骤540。FIG. 5 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 5 , performing clustering optimization on the second plurality of clusters based on the merging threshold (step 140 ) includes steps 510 to 540 .
在步骤510,确定第二多个簇集中的第一子集,第一子集中的每一个簇集的簇集尺寸均大于第三阈值。在一个示例中,可以根据每个簇集的簇集尺寸和第三阈值,将第二多个簇集分类为好的簇集(第一子集)以及坏的簇集。例如,若簇集的簇集尺寸大于第三阈值(例如可以设置为5),则该簇集属于第一子集。在一些示例性实施例中,可以根据最大簇集尺寸来确定用于分类的第三阈值。在一个示例中,确定第一子集包括:响应于确定最大簇集尺寸大于第三阈值,确定第二多个簇集中大于第三阈值的各簇集,以形成第一子集;以及响应于确定最大簇集尺寸小于或等于第三阈值,确定第二多个簇集中等于最大簇集尺寸的各簇集,以形成第一子集。At step 510, a first subset of the second plurality of clusters is determined, each cluster in the first subset having a cluster size greater than a third threshold. In one example, the second plurality of clusters may be classified into good clusters (first subset) and bad clusters based on a cluster size of each cluster and a third threshold. For example, if the cluster size of the cluster is greater than a third threshold (for example, it may be set to 5), then the cluster belongs to the first subset. In some exemplary embodiments, the third threshold for classification may be determined according to the maximum cluster size. In one example, determining the first subset includes: in response to determining that the largest cluster size is greater than a third threshold, determining clusters of the second plurality of clusters greater than a third threshold to form the first subset; and in response to The maximum cluster size is determined to be less than or equal to a third threshold, and clusters of the second plurality of clusters equal to the maximum cluster size are determined to form the first subset.
在步骤520,对于第一子集中每一个簇集:从该簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第四阈值个纳米孔测序信号。In step 520, for each cluster in the first subset: randomly select a fourth threshold nanopore sequencing signal from nanopore sequencing signals corresponding to each nucleotide sequence in the cluster.
在步骤530,计算从该簇集中随机选取的第四阈值个纳米孔测序信号中的每一个纳米孔测序信号与从第一子集中的另一簇集随机选取的第四阈值个纳米孔测序信号之间的相应第二动态时间规整距离。In step 530, the ratio of each of the fourth threshold nanopore sequencing signals randomly selected from the cluster to the fourth threshold randomly selected nanopore sequencing signals from another cluster in the first subset is calculated The corresponding second dynamic time warping distance between .
在步骤540,响应于确定相应第二动态时间规整距离均小于合并阈值,合并该簇集和另一簇集,以得到经合并的第一子集。In step 540, in response to determining that the respective second dynamic time warping distances are both less than the merge threshold, the cluster and another cluster are merged to obtain a merged first subset.
在一些示例性实施例中,在得到好的簇集(第一子集)之后,可以从第一子集的每个簇集中均随机挑选出第四阈值个序列所对应的纳米孔测序信号。第四阈值可以设置为例如3个。然后计算每两个簇集中所挑选出的各纳米孔测序信号之间的第二动态时间规整距离。如果来自一个簇集的第四阈值个纳米孔测序信号与来自另一个簇集的第四阈值个纳米孔测序信号之间的第二动态时间规整距离均小于合并阈值,则合并这两个簇集。经过对第一子集中的所有簇集进行类似的操作,可以得到合并后的第一子集。In some exemplary embodiments, after a good cluster (first subset) is obtained, nanopore sequencing signals corresponding to the fourth threshold sequence may be randomly selected from each cluster in the first subset. The fourth threshold can be set to three, for example. Then calculate the second dynamic time warping distance between the selected nanopore sequencing signals in every two clusters. Merging the two clusters if the second dynamic time warping distance between the fourth threshold nanopore sequencing signals from one cluster and the fourth threshold nanopore sequencing signals from the other cluster is less than the merge threshold . After performing similar operations on all the clusters in the first subset, the merged first subset can be obtained.
在一些示例性实施例中,步骤510至步骤540可以通过如下所示的算法3来实现:In some exemplary embodiments, step 510 to step 540 may be implemented through Algorithm 3 as follows:
Figure PCTCN2021116704-appb-000003
Figure PCTCN2021116704-appb-000003
如算法3所示,函数G ETM ERT HCFS IGNAL可以实现步骤510至步骤530。在函数G ETM ERT HCFS IGNAL中,MaxLength表示最大簇集尺寸,第五阈值设置为5。可见,当最大簇集尺寸MaxLength大于第三阈值(5)时,从第二多个簇集中选择出簇集尺寸大于第三阈值(5)的那些簇集,以形成第一子集GoodCluster。另一方面,当最大簇集尺寸MaxLength小于或等于第三阈值(5)时,从第二多个簇集中选择出簇集尺寸等于最大簇集尺寸MaxLength的那些簇集,以形成第一子集GoodCluster。进一步地,在通过步骤530和步骤540合并第一子集GoodCluster后可以得到合并后的第一子集RefineGoodCluster。 As shown in Algorithm 3, the function G ET M ER T H CFS IGNAL can implement step 510 to step 530 . In the function G ET M ER T H CFS IGNAL , MaxLength represents the maximum cluster size, and the fifth threshold is set to 5. It can be seen that when the maximum cluster size MaxLength is greater than the third threshold (5), those clusters whose cluster size is greater than the third threshold (5) are selected from the second plurality of clusters to form the first subset GoodCluster. On the other hand, when the maximum cluster size MaxLength is less than or equal to the third threshold (5), those clusters whose cluster size is equal to the maximum cluster size MaxLength are selected from the second plurality of clusters to form the first subset Good Cluster. Further, after the first subset GoodCluster is merged through step 530 and step 540, the merged first subset RefineGoodCluster can be obtained.
图6A-6B是根据本公开实施例的在图1的方法100中聚类优化的示例过程的示意图。首先参照图6A,图6A中示出了30条核苷酸序列以第二相似性阈值经过第一聚类后得到的第二多个簇集610、620、630、640、650和660。图中各序列中具有相同纹理的序列位于相同的簇集。例如,核苷酸序列641、642和643位于簇集640。接下来,通过步骤510,确定多个第一簇集610至660中的第一子集670。例如,通过设置第一阈值为5,则簇集尺寸大于5的簇集610、630和660形成第一子集670。而簇集尺寸小于5的簇集620、640和650均不属于第一子集670。6A-6B are schematic diagrams of an example process of cluster optimization in the method 100 of FIG. 1 , according to an embodiment of the present disclosure. Referring first to FIG. 6A , FIG. 6A shows the second plurality of clusters 610 , 620 , 630 , 640 , 650 and 660 obtained after the first clustering of 30 nucleotide sequences with the second similarity threshold. Sequences with the same texture in each sequence in the figure are located in the same cluster. For example, nucleotide sequences 641 , 642 and 643 are located in cluster 640 . Next, through step 510 , a first subset 670 of the plurality of first clusters 610 to 660 is determined. For example, by setting the first threshold value to 5, the clusters 610 , 630 and 660 with a cluster size larger than 5 form the first subset 670 . However, clusters 620 , 640 and 650 with a cluster size smaller than 5 do not belong to the first subset 670 .
接下来参照图6B,图6B示出了对第一子集670中的各簇集610、630和660进行合并优化的示例操作。以簇集610和簇集630为例,首先从簇集610中随机选出第四阈值(图6B中设置为3)个序列611、612和613。同样地,从簇集620中随机选出3个序列631、632和633。然后分别计算来自不同簇集的序列所对应的纳米孔测序信号之间的第二动态时间规整距离。例如,对于序列611,分别计算序列611和序列631之间、序列611和序列632以及序列611和序列633之间的第二动态时间规整距离。对于序列612和序列613,也进行类似的操作。接下来,判断680所有第二动态时间规整距离是否均小于合并阈值。例如,序列611、612和613与序列631、632和633的所有第二动态时间规整距离均小于合并阈值,则合并簇集610和簇集630,以得到簇集690。另一方面,从簇集660中随机挑选序列661、662和663。然后分别与来自簇集610的序列611、612和613,以及来自簇集630的序列631、632和633进行类似的操作。若判断簇集660与簇集610或簇集630之间均不满足合并条件,则不进行合并。Referring next to FIG. 6B , FIG. 6B illustrates example operations for merging optimization on each of the clusters 610 , 630 , and 660 in the first subset 670 . Taking the cluster 610 and the cluster 630 as an example, the fourth threshold (set to 3 in FIG. 6B ) sequences 611 , 612 and 613 are first randomly selected from the cluster 610 . Likewise, three sequences 631 , 632 and 633 are randomly selected from cluster 620 . Then the second dynamic time warping distances between the nanopore sequencing signals corresponding to the sequences from different clusters are calculated respectively. For example, for the sequence 611, the second dynamic time warping distances between the sequence 611 and the sequence 631, between the sequence 611 and the sequence 632, and between the sequence 611 and the sequence 633 are respectively calculated. Similar operations are also performed for sequence 612 and sequence 613 . Next, judge 680 whether all the second dynamic time warping distances are smaller than the combining threshold. For example, all the second dynamic time warping distances between the sequences 611 , 612 and 613 and the sequences 631 , 632 and 633 are smaller than the merging threshold, then the cluster 610 and the cluster 630 are merged to obtain the cluster 690 . On the other hand, sequences 661 , 662 and 663 are randomly picked from cluster 660 . Similar operations are then performed with sequences 611, 612, and 613 from cluster 610, and sequences 631, 632, and 633 from cluster 630, respectively. If it is determined that neither the cluster 660 nor the cluster 610 or the cluster 630 meets the merging condition, no merging is performed.
因此,本公开的实施例能够利用纳米孔测序信号来对第一聚类中的部分簇集进行合并,从而提高了聚类精度。Therefore, the embodiments of the present disclosure can use the nanopore sequencing signal to merge some of the clusters in the first cluster, thereby improving the clustering accuracy.
在一些示例性实施例中,若第二多个簇集中除去经合并的第一子集的第二子集为非空,则可以对聚类做进一步优化。图7是根据本公开实施例的在图1的方法100中的聚类优化的示例过程的流程图。如图7所示,聚类优化(步骤140)进一步包括步骤710至步骤740。In some exemplary embodiments, clustering may be further optimized if the second subset of the second plurality of clusters excluding the merged first subset is non-empty. FIG. 7 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure. As shown in FIG. 7 , cluster optimization (step 140 ) further includes steps 710 to 740 .
在步骤710,确定经合并的第一子集中的每一个簇集所对应的一致序列信号。在一个示例中,可以先确定一个簇集的一致序列,然后确定相应的一致序列的纳米孔测序信号,以得到一致序列信号。In step 710, a consensus sequence signal corresponding to each cluster in the merged first subset is determined. In one example, the consensus sequence of a cluster can be determined first, and then the nanopore sequencing signal of the corresponding consensus sequence can be determined to obtain the consensus sequence signal.
在步骤720,对于所第二子集中包括的每一个核苷酸序列:对于每一个一致性序列信号:计算该核苷酸序列所对应的纳米孔测序信号与该一致性序列信号之间的第三动态时间规整距离。对于第二子集中的每一个序列,计算该序列与所有一致性序列信号之间的各第三动态时间规整距离。In step 720, for each nucleotide sequence included in the second subset: for each consensus sequence signal: calculate the first distance between the nanopore sequencing signal corresponding to the nucleotide sequence and the consensus sequence signal Three dynamic time warping distances. For each sequence in the second subset, respective third dynamic time warping distances between that sequence and all consensus sequence signals are calculated.
在步骤730,响应于确定第三动态时间规整距离小于合并阈值,添加该核苷酸序列到与该一致序列信号对应的经合并的第一子集中的簇集,以更新经合并的第一子集。在一个示例中,如果第二子集中的一个序列与一个一致性序列信号之间的第三动态时间规整距离小于合并阈值,则将该序列添加到该一致性序列信号所对应的簇集中。In step 730, in response to determining that the third dynamic time warping distance is less than the merging threshold, adding the nucleotide sequence to a cluster in the merged first subset corresponding to the consensus sequence signal to update the merged first subset set. In an example, if the third dynamic time warping distance between a sequence in the second subset and a consensus sequence signal is smaller than the merging threshold, the sequence is added to the cluster corresponding to the consensus sequence signal.
在步骤740,从所第二子集中移除被添加到经合并的第一子集的各核苷酸序列,以更新第二子集。At step 740, the nucleotide sequences added to the merged first subset are removed from the second subset to update the second subset.
在一些示例性实施例中,步骤710至步骤740可以通过如下所示的算法4来实现:In some exemplary embodiments, step 710 to step 740 may be implemented by Algorithm 4 as follows:
Figure PCTCN2021116704-appb-000004
Figure PCTCN2021116704-appb-000004
如算法4所示,RefineGoodCluster表示经合并的第一子集,OSS表示第二子集,InitialCFSignalSet表示第一子集RefineGoodCluster中各簇集的一致序列信号的集合,以及threshold表示合并阈值。As shown in Algorithm 4, RefineGoodCluster represents the merged first subset, OSS represents the second subset, InitialCFSignalSet represents the set of consensus sequence signals of each cluster in the first subset RefineGoodCluster, and threshold represents the merge threshold.
图8是根据本公开实施例的在图1的方法100中的聚类优化的示例过程的示意图。如图8所示,经合并的第一子集包括簇集810和簇集820。分别确定簇集810的一致序列信号811和簇集820的一致序列信号821。第二子集830包括9条核苷酸序列。以序列831为例,分别计算序列831所对应的纳米孔测序信号与一致序列信号811和一致序列信号821之间的第三动态时间规整距离(分别用d 1和d 2表示)。然后通过比较器840判 断d 1和d 2是否小于合并阈值。例如,当d 1小于合并阈值,则将序列831添加到一致序列信号811所对应的簇集810中,以更新簇集810为簇集810’。类似地,当序列833所对应的纳米孔测序信号与一致序列信号821之间的第三动态时间规整距离小于合并阈值,则将序列833添加到一致序列821所对应的簇集820中,一个新簇集820为簇集820’。相应地,第二子集830将移除例如序列832和833。另一方面,当序列832所对应的纳米孔测序信号与一致序列信号811或与一致序列信号821之间的第三动态时间规整距离均大于或等于合并阈值,则第二子集830保留序列832。最后,得到更新后的第二子集830’。 FIG. 8 is a schematic diagram of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the present disclosure. As shown in FIG. 8 , the merged first subset includes cluster 810 and cluster 820 . The consensus sequence signal 811 for cluster 810 and the consensus sequence signal 821 for cluster 820 are respectively determined. The second subset 830 includes 9 nucleotide sequences. Taking the sequence 831 as an example, calculate the third dynamic time warping distance (denoted by d 1 and d 2 respectively) between the nanopore sequencing signal corresponding to the sequence 831 and the consensus sequence signal 811 and the consensus sequence signal 821 . Then the comparator 840 judges whether d 1 and d 2 are smaller than the merge threshold. For example, when d 1 is less than the merge threshold, the sequence 831 is added to the cluster 810 corresponding to the consensus sequence signal 811, so as to update the cluster 810 to be the cluster 810'. Similarly, when the third dynamic time warping distance between the nanopore sequencing signal corresponding to the sequence 833 and the consensus sequence signal 821 is smaller than the merge threshold, the sequence 833 is added to the cluster 820 corresponding to the consensus sequence 821, and a new Cluster 820 is cluster 820'. Accordingly, the second subset 830 will remove sequences 832 and 833, for example. On the other hand, when the third dynamic time warping distance between the nanopore sequencing signal corresponding to the sequence 832 and the consensus sequence signal 811 or the consensus sequence signal 821 is greater than or equal to the merge threshold, the second subset 830 retains the sequence 832 . Finally, the updated second subset 830' is obtained.
因此,本公开的实施例利用一致序列信号,对合并后的第二多个簇集进行进一步优化,从而提高了聚类精度。Therefore, the embodiment of the present disclosure utilizes the consensus sequence signal to further optimize the merged second plurality of clusters, thereby improving the clustering accuracy.
在一些示例性实施例中,响应于经更新的第二子集为非空,可以进行进一步地聚类优化。In some exemplary embodiments, further cluster optimization may be performed in response to the updated second subset being non-empty.
在一个示例中,基于经更新的第二子集中各核苷酸序列所对应的纳米孔测序信号,对经更新的第二子集进行聚类,以得到至少一个簇集。至少一个簇集中的每个簇集中的每两个核苷酸序列所对应的纳米孔测序信号之间的第四动态时间规整距离均小于所述合并阈值,并且更新后的经合并的第一子集和所述至少一个簇集形成所述第三多个簇集。在一个示例中,可以参照算法4以实现上述步骤。在算法4中,G表示至少一个簇集。对于在至少一个簇集G中的各序列,其两两之间的第四动态时间规整距离均小于合并阈值。In one example, clustering is performed on the updated second subset based on nanopore sequencing signals corresponding to each nucleotide sequence in the updated second subset to obtain at least one cluster. The fourth dynamic time warping distance between the nanopore sequencing signals corresponding to every two nucleotide sequences in each cluster in at least one cluster is smaller than the merge threshold, and the updated merged first subset The set and the at least one cluster form the third plurality of clusters. In an example, Algorithm 4 can be referred to to implement the above steps. In Algorithm 4, G represents at least one cluster. For each sequence in at least one cluster G, the fourth dynamic time warping distance between any pair of them is smaller than the merging threshold.
图9是根据本公开实施例的在图1的方法100中的聚类优化的示例过程的示意图。如图9所示,经更新的第二子集包括核苷酸序列910、920、930和940。以核苷酸序列910为例,首先分别计算序列910与纳米孔测序信号与序列920、930和940所对应的各纳米孔测序信号之间的第四动态时间规整距离921、931和941。通过比较器950将第四动态时间规整距离921、931和941与合并阈值进行比较。例如,当第四动态时间规整 距离921小于合并阈值,则生成一个新的簇集960。簇集960包括序列910和920。由于距离931和941仍然大于或等于合并阈值,接下来计算序列930所对应的纳米孔测序信号和序列940所对应的纳米孔测序信号之间的第四动态时间规整距离943。然后再通过比较器950判断第四动态时间规整距离943是否小于合并阈值。当第四动态时间规整距离943是小于合并阈值,则生成另一个新的簇集970。簇集970包括序列930和940。FIG. 9 is a schematic diagram of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the present disclosure. As shown in FIG. 9 , the updated second subset includes nucleotide sequences 910 , 920 , 930 and 940 . Taking the nucleotide sequence 910 as an example, firstly calculate the fourth dynamic time warping distances 921 , 931 and 941 between the sequence 910 and the nanopore sequencing signals corresponding to the nanopore sequencing signals and the sequences 920 , 930 and 940 . The fourth dynamic time warping distances 921 , 931 and 941 are compared by a comparator 950 with a merge threshold. For example, when the fourth dynamic time warping distance 921 is less than the merging threshold, a new cluster 960 is generated. Cluster 960 includes sequences 910 and 920 . Since the distances 931 and 941 are still greater than or equal to the merging threshold, next, a fourth dynamic time warping distance 943 between the nanopore sequencing signal corresponding to the sequence 930 and the nanopore sequencing signal corresponding to the sequence 940 is calculated. Then, the comparator 950 is used to determine whether the fourth dynamic time warping distance 943 is smaller than the combination threshold. When the fourth dynamic time warping distance 943 is smaller than the merging threshold, another new cluster 970 is generated. Cluster 970 includes sequences 930 and 940 .
因此,本公开的实施例可以对没有归类到更新的第二簇集的核苷酸序列进行进一步优化,从而提高了聚类精度。Therefore, the embodiments of the present disclosure can further optimize the nucleotide sequences that are not classified into the updated second cluster, thereby improving the clustering accuracy.
图10是根据本公开实施例的在图1的方法100中聚类优化的示例过程的流程图。在这样的实施例中,响应于第二多个簇集中除第三多个簇集的第三子集为非空,聚类优化(步骤140)可以进一步包括步骤1010至1050。在一个示例中,步骤1010至1050可以作为一种聚类结果的检查机制,用于将还没有添加到簇集的核苷酸序列找出来。例如有些核苷酸序列由于翻译错误而导致长度非常短。步骤1010至1050可以将此类核苷酸序列添加到相应的簇集中。FIG. 10 is a flowchart of an example process of cluster optimization in the method 100 of FIG. 1 according to an embodiment of the disclosure. In such an embodiment, the cluster optimization (step 140 ) may further include steps 1010 to 1050 in response to a third subset of the second plurality of clusters other than the third plurality of clusters being non-empty. In one example, steps 1010 to 1050 can be used as a checking mechanism for clustering results, for finding nucleotide sequences that have not been added to the cluster. For example, some nucleotide sequences are very short in length due to translation errors. Steps 1010 to 1050 can add such nucleotide sequences to the corresponding clusters.
在步骤1010,对于第三子集中的每一条核苷酸序列:计算该核苷酸序列所对应的纳米孔测序信号与从第三多个簇集中随机选择的一条核苷酸序列所对应的纳米孔信号之间的第五动态时间规整距离。在一个示例中,In step 1010, for each nucleotide sequence in the third subset: calculate the nanopore sequencing signal corresponding to the nucleotide sequence and the nanopore corresponding to a nucleotide sequence randomly selected from the third plurality of clusters Fifth dynamic time warping distance between hole signals. In one example,
在步骤1020,响应于确定第五动态时间规整距离小于合并阈值,添加该核苷酸序列到第三多个簇集中的包括随机选择的核苷酸序列的簇集,以更新第三多个簇集。In step 1020, in response to determining that the fifth dynamic time warping distance is less than the merge threshold, adding the nucleotide sequence to a cluster of the third plurality of clusters comprising a randomly selected nucleotide sequence to update the third plurality of clusters set.
在步骤1030,从第三子集中移除添加到第三多个簇集的各核苷酸序列,以更新第三子集。At step 1030, each nucleotide sequence added to the third plurality of clusters is removed from the third subset to update the third subset.
在一些示例性实施例中,步骤1010至步骤1030可以通过如下所示的算法5来实现:In some exemplary embodiments, step 1010 to step 1030 may be implemented by Algorithm 5 as follows:
Figure PCTCN2021116704-appb-000005
Figure PCTCN2021116704-appb-000005
如算法5所示,Clusters now表示当前的第三多个簇集,NN表示第三子集,ss表示从当前第三多个簇集Clusters now中随机选择的序列。算法5计算第三子集NN中的每一个序列nn所对应的纳米孔测序信号与ss之间的第五动态时间规整距离,并且判断该距离是否小于合并阈值Threshold。若小于合并阈值Threshold,则将nn添加到ss所在的簇集Cluster中。 As shown in Algorithm 5, Clusters now represents the current third cluster, NN represents the third subset, and ss represents a sequence randomly selected from the current third cluster Clusters now . Algorithm 5 calculates the fifth dynamic time warping distance between the nanopore sequencing signal corresponding to each sequence nn in the third subset NN and ss, and judges whether the distance is smaller than the merging threshold Threshold. If it is less than the merge threshold Threshold, add nn to the Cluster where ss is located.
在一些示例性实施例中,响应于经更新的第三子集为非空,聚类优化进一步包括步骤1040和步骤1050。In some exemplary embodiments, in response to the updated third subset being non-empty, the cluster optimization further includes step 1040 and step 1050 .
在步骤1040,将经更新的第三子集中的每一个核苷酸序列归类为一个相应的单独簇集。At step 1040, each nucleotide sequence in the updated third subset is classified into a corresponding individual cluster.
在步骤1050,将各相应的单独簇集添加到经更新的第三多个簇集。At step 1050, each respective individual cluster is added to the updated third plurality of clusters.
因此,本公开的实施例还引入了检查机制,对还没有添加到簇集的核苷酸序列进行聚类,从而保证了聚类结果的完整性。Therefore, the embodiment of the present disclosure also introduces a checking mechanism to perform clustering on nucleotide sequences that have not been added to the clustering, thereby ensuring the integrity of the clustering results.
在一些示例性实施例中,多条核苷酸序列来自多个单细胞,来自相同单细胞的核苷酸序列具有相同的标签,并且来自不同单细胞的核苷酸序列具有不同的标签。在一个 示例中,本公开的实施例不仅能够用于无标签的聚类,更可以用于聚类带有标签的核苷酸序列,并且将聚类结果与标签直接关联。In some exemplary embodiments, the multiple nucleotide sequences are from multiple single cells, the nucleotide sequences from the same single cell have the same tag, and the nucleotide sequences from different single cells have different tags. In one example, the embodiments of the present disclosure can not only be used for clustering without labels, but also can be used for clustering nucleotide sequences with labels, and directly associate the clustering results with labels.
在一些示例性实施例中,第三多个簇集与各自相应的标签相关联,方法100进一步包括:基于第三多个簇集和与第三多个簇集相关联的相应标签,从多条核苷酸序列中分离出来自多个单细胞中的每个单细胞的核苷酸序列。在一个示例中,测序文库中的序列被整合上标签,并且标签反映了序列的来源细胞。通过方法100,可以对带有标签的序列进行聚类。完成聚类之后,能够根据聚类的结果从多个条核苷酸序列中分离出每个单细胞的核苷酸序列。因此,方法100能够从混有多个单细胞的大量核苷酸序列中基于单细胞来源而分离出每个单细胞的核苷酸序列,从而能够提高单细胞测序的精度。In some exemplary embodiments, the third plurality of clusters is associated with respective corresponding tags, and the method 100 further includes: based on the third plurality of clusters and the corresponding tags associated with the third plurality of clusters, selecting from the plurality of clusters The nucleotide sequences from each single cell in the plurality of single cells are isolated from the nucleotide sequences. In one example, the sequences in the sequencing library are integrated with tags, and the tags reflect the cell of origin of the sequences. By method 100, labeled sequences can be clustered. After the clustering is completed, the nucleotide sequence of each single cell can be separated from multiple nucleotide sequences according to the clustering result. Therefore, the method 100 can separate the nucleotide sequence of each single cell based on the source of the single cell from a large number of nucleotide sequences mixed with multiple single cells, thereby improving the accuracy of single cell sequencing.
图11是根据本公开实施例的用于单细胞测序的装置1100的框图。如图11所示,单细胞测序装置1100包括获取模块1110、第一相似性聚类模块1120、确定模块1130、第二相似性聚类模块1140和聚类优化模块1150。FIG. 11 is a block diagram of an apparatus 1100 for single-cell sequencing according to an embodiment of the present disclosure. As shown in FIG. 11 , the single cell sequencing apparatus 1100 includes an acquisition module 1110 , a first similarity clustering module 1120 , a determination module 1130 , a second similarity clustering module 1140 and a clustering optimization module 1150 .
获取模块1110被配置为获取测序文库中的多条核苷酸序列和多条核苷酸序列对应的纳米孔测序信号。The acquiring module 1110 is configured to acquire multiple nucleotide sequences in the sequencing library and nanopore sequencing signals corresponding to the multiple nucleotide sequences.
第一相似性聚类模块1120被配置为基于第一相似性阈值,对多条核苷酸序列进行第一聚类,以得到第一多个簇集,第一多个簇集包括具有最大簇集尺寸的最大簇集。The first similarity clustering module 1120 is configured to perform first clustering on a plurality of nucleotide sequences based on a first similarity threshold to obtain a first plurality of clusters, the first plurality of clusters includes the cluster with the largest The largest cluster set of set size.
确定模块1130被配置为基于多条核苷酸序列对应的纳米孔测序信号的信号长度均值和最大簇集中的各核苷酸序列所对应的纳米孔测序信号,确定合并阈值。The determining module 1130 is configured to determine the merge threshold based on the average signal length of nanopore sequencing signals corresponding to multiple nucleotide sequences and the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
第二相似性聚类模块1140被配置为基于第二相似性阈值,对多条核苷酸序列进行第一聚类,以得到第二多个簇集,第一相似性阈值大于第二相似性阈值The second similarity clustering module 1140 is configured to perform first clustering on a plurality of nucleotide sequences based on a second similarity threshold to obtain a second plurality of clusters, the first similarity threshold being greater than the second similarity threshold
聚类优化模块1150被配置为基于合并阈值对所述第二多个簇集进行聚类优化,以得到第三多个簇集。The cluster optimization module 1150 is configured to perform cluster optimization on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
在一些示例性实施例中,确定模块1130包括第一选取子模块1131、第一计算子模块1132和第一确定子模块1133。In some exemplary embodiments, the determination module 1130 includes a first selection submodule 1131 , a first calculation submodule 1132 and a first determination submodule 1133 .
第一选取子模块1131被配置为从最大簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第一阈值个纳米孔测序信号。The first selecting submodule 1131 is configured to randomly select a first threshold number of nanopore sequencing signals from the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster.
第一计算子模块1132被配置为计算第一阈值个纳米孔测序信号中每两个纳米孔测序信号之间的第一动态时间规整距离。The first calculation sub-module 1132 is configured to calculate a first dynamic time warping distance between every two nanopore sequencing signals in the first threshold number of nanopore sequencing signals.
第一确定子模块1133被配置为基于各第一动态时间规整距离的总和、多条核苷酸序列所对应的纳米孔测序信号的信号长度均值和最大簇集尺寸,确定合并阈值。The first determination sub-module 1133 is configured to determine the merging threshold based on the sum of the first dynamic time warping distances, the average signal length of the nanopore sequencing signals corresponding to the multiple nucleotide sequences and the maximum cluster size.
在一些示例性实施例中,聚类优化模块1150包括第二确定子模块1151、第二选取子模块1152、第二计算子模块1153和合并子模块1154。In some exemplary embodiments, the cluster optimization module 1150 includes a second determination submodule 1151 , a second selection submodule 1152 , a second calculation submodule 1153 and a merging submodule 1154 .
第二确定子模块1151被配置为基于最大簇集尺寸和第三阈值,确定第二多个簇集中的第一子集。The second determination sub-module 1151 is configured to determine a first subset of the second plurality of clusters based on the maximum cluster size and the third threshold.
第二选取子模块1152被配置为对于第一子集中每一个簇集:从该簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第四阈值个纳米孔测序信号。The second selecting submodule 1152 is configured to, for each cluster in the first subset: randomly select a fourth threshold nanopore sequencing signal from the nanopore sequencing signals corresponding to each nucleotide sequence in the cluster.
第二计算子模块1153被配置为计算从该簇集中随机选取的第四阈值个纳米孔测序信号中的每一个纳米孔测序信号与从第一子集中的另一簇集随机选取的第四阈值个纳米孔测序信号之间的相应第二动态时间规整距离。The second calculation sub-module 1153 is configured to calculate the ratio of each nanopore sequencing signal among the fourth threshold nanopore sequencing signals randomly selected from the cluster to the fourth threshold randomly selected from another cluster in the first subset. The corresponding second dynamic time warping distance between nanopore sequencing signals.
合并子模块1154被配置为响应于确定相应第二动态时间规整距离均小于所述合并阈值,合并该簇集和所述另一簇集,以得到经合并的第一子集。The merging sub-module 1154 is configured to, in response to determining that the respective second dynamic time warping distances are both smaller than the merging threshold, merge the cluster and the other cluster to obtain a merged first subset.
在一些示例性实施例中,响应于第二多个簇集中除经合并的第一子集的第二子集为非空,聚类优化模块1150进一步包括第三确定子模块1155、第三计算子模块1156、第一更新子模块1157和第二更新子模块1158。In some exemplary embodiments, in response to the second subset of the second plurality of clusters excluding the merged first subset being non-empty, the cluster optimization module 1150 further includes a third determining submodule 1155, a third calculating A submodule 1156 , a first updating submodule 1157 and a second updating submodule 1158 .
第三确定子模块1155被配置为确定经合并的第一子集中的每一个簇集所对应的一致序列信号。The third determination sub-module 1155 is configured to determine the consensus sequence signal corresponding to each cluster in the merged first subset.
第三计算子模块1156被配置为对于第二子集中包括的每一个核苷酸序列:对于每一个一致性序列信号:计算该核苷酸序列所对应的纳米孔测序信号与该一致性序列信号之间的第三动态时间规整距离。The third calculation submodule 1156 is configured to: for each nucleotide sequence included in the second subset: for each consensus sequence signal: calculate the nanopore sequencing signal and the consensus sequence signal corresponding to the nucleotide sequence The third dynamic time warping distance between .
第一更新子模块1157被配置为响应于确定第三动态时间规整距离小于所述合并阈值,添加该核苷酸序列到与该一致序列信号对应的经合并的第一子集中的簇集,以更新经合并的第一子集。The first update submodule 1157 is configured to add the nucleotide sequence to a cluster in the merged first subset corresponding to the consensus sequence signal in response to determining that the third dynamic time warping distance is less than the merge threshold, to The merged first subset is updated.
第二更新子模块1158被配置为从第二子集中移除被添加到经合并的第一子集的各核苷酸序列,以更新所述第二子集。The second update sub-module 1158 is configured to remove from the second subset nucleotide sequences added to the merged first subset to update the second subset.
应当理解,图11中所示装置1100的各个模块可以与上文参考图1-10描述的方法100中的各个步骤相对应。由此,上面针对方法100描述的操作、特征和优点同样适用于装置1100及其包括的模块。为了简洁起见,某些操作、特征和优点在此不再赘述。It should be understood that each module of the apparatus 1100 shown in FIG. 11 may correspond to each step in the method 100 described above with reference to FIGS. 1-10 . Thus, the operations, features and advantages described above with respect to the method 100 are also applicable to the apparatus 1100 and the modules it includes. For the sake of brevity, some operations, features and advantages are not described in detail here.
虽然上面参考特定模块讨论了特定功能,但是应当注意,本文讨论的各个模块的功能可以分为多个模块,和/或多个模块的至少一些功能可以组合成单个模块。本文讨论的特定模块执行动作包括该特定模块本身执行该动作,或者替换地该特定模块调用或以其他方式访问执行该动作(或结合该特定模块一起执行该动作)的另一个组件或模块。因此,执行动作的特定模块可以包括执行动作的该特定模块本身和/或该特定模块调用或以其他方式访问的、执行动作的另一模块。Although particular functions are discussed above with reference to particular modules, it should be noted that the functionality of various modules discussed herein may be divided into multiple modules, and/or at least some of the functionality of multiple modules may be combined into a single module. A discussion herein of a particular module performing an action includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Accordingly, a particular module that performs an action may include the particular module that performs the action itself and/or another module that the particular module calls or otherwise accesses that performs the action.
还应当理解,本文可以在软件硬件元件或程序模块的一般上下文中描述各种技术。上面关于图1100描述的各个模块可以在硬件中或在结合软件和/或固件的硬件中实现。例如,这些模块可以被实现为计算机程序代码/指令,该计算机程序代码/指令被配置 为在一个或多个处理器中执行并存储在计算机可读存储介质中。可替换地,这些模块可以被实现为硬件逻辑/电路。It should also be understood that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to diagram 1100 may be implemented in hardware or in hardware combined with software and/or firmware. For example, these modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry.
根据本公开的一个方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的至少一个存储器,至少一个存储器存储有指令,指令在被至少一个处理器执行时,使至少一个处理器执行上述的方法。According to one aspect of the present disclosure, an electronic device is provided, including: at least one processor; and at least one memory connected to the at least one processor in communication, the at least one memory stores instructions, and when the instructions are executed by the at least one processor , causing at least one processor to execute the above method.
根据本公开的另一个方面,提供了一种存储有指令的非瞬时计算机可读存储介质,指令在被计算机的至少一个处理器执行时,使计算机执行上述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions. When executed by at least one processor of a computer, the instructions cause the computer to execute the above method.
根据本公开的另一个方面,提供了一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现上述的方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, and the computer program implements the above method when executed by a processor.
图12示出了可以被用来实施本文所描述的方法的电子设备1200的示例配置。FIG. 12 shows an example configuration of an electronic device 1200 that may be used to implement the methods described herein.
电子设备1200可以是各种不同类型的设备。电子设备1200的示例包括但不限于:台式计算机、服务器计算机、笔记本电脑或上网本计算机、移动设备(例如,平板电脑、蜂窝或其他无线电话(例如,智能电话)、记事本计算机、移动台)、可穿戴设备(例如,眼镜、手表)、娱乐设备(例如,娱乐器具、通信地耦合到显示设备的机顶盒、游戏机)、电视或其他显示设备、汽车计算机等等。 Electronic device 1200 may be various different types of devices. Examples of electronic device 1200 include, but are not limited to: desktop computers, server computers, notebook or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless telephones (e.g., smartphones), notepad computers, mobile stations), Wearable devices (eg, glasses, watches), entertainment devices (eg, entertainment appliances, set-top boxes communicatively coupled to display devices, game consoles), televisions or other display devices, automotive computers, and the like.
电子设备1200可以包括能够诸如通过***总线1214或其他适当的连接彼此通信的至少一个处理器1202、存储器1204、(多个)通信接口1206、显示设备1208、其他输入/输出(I/O)设备1210以及一个或更多大容量存储设备1212。 Electronic device 1200 may include at least one processor 1202, memory 1204, communication interface(s) 1206, display device 1208, other input/output (I/O) devices capable of communicating with each other, such as through a system bus 1214 or other suitable connection. 1210 and one or more mass storage devices 1212.
处理器1202可以是单个处理单元或多个处理单元,所有处理单元可以包括单个或多个计算单元或者多个核心。处理器1202可以被实施成一个或更多微处理器、微型计算机、微控制器、数字信号处理器、中央处理单元、状态机、逻辑电路和/或基于操作指令来操纵信号的任何设备。除了其他能力之外,处理器1202可以被配置成获取并且执行 存储在存储器1204、大容量存储设备1212或者其他计算机可读介质中的计算机可读指令,诸如操作***1216的程序代码、应用程序1218的程序代码、其他程序1220的程序代码等。The processor 1202 may be a single processing unit or multiple processing units, and all processing units may include single or multiple computing units or multiple cores. Processor 1202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and/or any device that manipulates signals based on operational instructions. Among other capabilities, processor 1202 may be configured to retrieve and execute computer-readable instructions stored in memory 1204, mass storage device 1212, or other computer-readable media, such as program code for operating system 1216, application programs 1218 program code of other programs 1220, etc.
存储器1204和大容量存储设备1212是用于存储指令的计算机可读存储介质的示例,所述指令由处理器1202执行来实施前面所描述的各种功能。举例来说,存储器1204一般可以包括易失性存储器和非易失性存储器二者(例如RAM、ROM等等)。此外,大容量存储设备1212一般可以包括硬盘驱动器、固态驱动器、可移除介质、包括外部和可移除驱动器、存储器卡、闪存、软盘、光盘(例如CD、DVD)、存储阵列、网络附属存储、存储区域网等等。存储器1204和大容量存储设备1212在本文中都可以被统称为存储器或计算机可读存储介质,并且可以是能够把计算机可读、处理器可执行程序指令存储为计算机程序代码的非暂态介质,所述计算机程序代码可以由处理器1202作为被配置成实施在本文的示例中所描述的操作和功能的特定机器来执行。 Memory 1204 and mass storage device 1212 are examples of computer-readable storage media for storing instructions for execution by processor 1202 to implement the various functions described above. For example, memory 1204 may generally include both volatile and non-volatile memory (eg, RAM, ROM, etc.). Additionally, mass storage devices 1212 may generally include hard drives, solid state drives, removable media including external and removable drives, memory cards, flash memory, floppy disks, optical disks (eg, CD, DVD), storage arrays, network attached storage , storage area network and so on. Both the memory 1204 and the mass storage device 1212 may be collectively referred to herein as a memory or a computer-readable storage medium, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program codes, The computer program code may be executed by the processor 1202 as a specific machine configured to implement the operations and functions described in the examples herein.
多个程序可以存储在大容量存储设备1212上。这些程序包括操作***1216、一个或多个应用程序1218、其他程序1220和程序数据1222,并且它们可以被加载到存储器1204以供执行。这样的应用程序或程序模块的示例可以包括例如用于实现以下部件/功能的计算机程序逻辑(例如,计算机程序代码或指令)获取模块1110、第一聚类模块1120、确定模块1130和聚类优化模块1140、方法100(包括方法100的任何合适的步骤)、和/或本文描述的另外的实施例。Multiple programs may be stored on mass storage device 1212 . These programs include operating system 1216, one or more application programs 1218, other programs 1220, and program data 1222, and they may be loaded into memory 1204 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) acquisition module 1110, first clustering module 1120, determination module 1130, and clustering optimization for implementing the following components/functions: Module 1140, method 100 (including any suitable steps of method 100), and/or additional embodiments described herein.
虽然在图12中被图示成存储在电子设备1200的存储器1204中,但是模块1216、1218、1220和1222或者其部分可以使用可由电子设备1200访问的任何形式的计算机可读介质来实施。如本文所使用的,“计算机可读介质”至少包括两种类型的计算机可读介质,也就是计算机可读存储介质和通信介质。Although illustrated in FIG. 12 as being stored in memory 1204 of electronic device 1200 , modules 1216 , 1218 , 1220 , and 1222 , or portions thereof, may be implemented using any form of computer-readable media that is accessible by electronic device 1200 . As used herein, "computer-readable media" includes at least two types of computer-readable media, namely, computer-readable storage media and communication media.
计算机可读存储介质包括通过用于存储信息的任何方法或技术实施的易失性和非易失性、可移除和不可移除介质,所述信息诸如是计算机可读指令、数据结构、程序模块或者其他数据。计算机可读存储介质包括而不限于RAM、ROM、EEPROM、闪存或其他存储器技术,CD-ROM、数字通用盘(DVD)、或其他光学存储装置,磁盒、磁带、磁盘存储装置或其他磁性存储设备,或者可以被用来存储信息以供电子设备访问的任何其他非传送介质。与此相对,通信介质可以在诸如载波或其他传送机制之类的已调制数据信号中具体实现计算机可读指令、数据结构、程序模块或其他数据。本文所定义的计算机可读存储介质不包括通信介质。Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented by any method or technology for storage of information, such as computer-readable instructions, data structures, program module or other data. Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory, or other memory technology, CD-ROM, digital versatile disk (DVD), or other optical storage device, magnetic cartridge, tape, magnetic disk storage device, or other magnetic storage device, or any other non-transmission medium that can be used to store information for access by an electronic device. In contrast, communication media may embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer-readable storage media as defined herein do not include communication media.
一个或更多通信接口1206用于诸如通过网络、直接连接等等与其他设备交换数据。这样的通信接口可以是以下各项中的一个或多个:任何类型的网络接口(例如,网络接口卡(NIC))、有线或无线(诸如IEEE 802.11无线LAN(WLAN))无线接口、全球微波接入互操作(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、Bluetooth TM接口、近场通信(NFC)接口等。通信接口1206可以促进在多种网络和协议类型内的通信,其中包括有线网络(例如LAN、电缆等等)和无线网络(例如WLAN、蜂窝、卫星等等)、因特网等等。通信接口1206还可以提供与诸如存储阵列、网络附属存储、存储区域网等等中的外部存储装置(未示出)的通信。 One or more communication interfaces 1206 are used to exchange data with other devices, such as over a network, direct connection, and the like. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), wired or wireless (such as IEEE 802.11 wireless LAN (WLAN)) wireless interface, global microwave Access Interoperability (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth TM interface, Near Field Communication (NFC) interface, etc. The communication interface 1206 can facilitate communication within a variety of networks and protocol types, including wired networks (eg, LAN, cable, etc.) and wireless networks (eg, WLAN, cellular, satellite, etc.), the Internet, and the like. Communication interface 1206 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and the like.
在一些示例中,可以包括诸如监视器之类的显示设备1208,以用于向用户显示信息和图像。其他I/O设备1210可以是接收来自用户的各种输入并且向用户提供各种输出的设备,并且可以包括触摸输入设备、手势输入设备、摄影机、键盘、遥控器、鼠标、打印机、音频输入/输出设备等等。In some examples, a display device 1208, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 1210 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/ output devices, etc.
本文描述的技术可以由电子设备1200的这些各种配置来支持,并且不限于本文所描述的技术的具体示例。例如,该功能还可以通过使用分布式***在“云”上全部或部分 地实现。云包括和/或代表用于资源的平台。平台抽象云的硬件(例如,服务器)和软件资源的底层功能。资源可以包括在远离电子设备1200的服务器上执行计算处理时可以使用的应用和/或数据。资源还可以包括通过因特网和/或通过诸如蜂窝或Wi-Fi网络的订户网络提供的服务。平台可以抽象资源和功能以将电子设备1200与其他电子设备连接。因此,本文描述的功能的实现可以分布在整个云内。例如,功能可以部分地在电子设备1200上以及部分地通过抽象云的功能的平台来实现。The techniques described herein may be supported by these various configurations of the electronic device 1200 and are not limited to specific examples of the techniques described herein. For example, this functionality may also be implemented in whole or in part on a "cloud" by using a distributed system. A cloud includes and/or represents a platform for resources. The platform abstracts the underlying functionality of the cloud's hardware (eg, servers) and software resources. Resources may include applications and/or data that may be used when computing processing is performed on a server remote from the electronic device 1200 . Resources may also include services provided over the Internet and/or over a subscriber network, such as a cellular or Wi-Fi network. The platform can abstract resources and functions to connect the electronic device 1200 with other electronic devices. Accordingly, implementation of the functionality described herein may be distributed throughout the cloud. For example, the functions may be implemented partly on the electronic device 1200 and partly through a platform that abstracts the functions of the cloud.
虽然在附图和前面的描述中已经详细地说明和描述了本公开,但是这样的说明和描述应当被认为是说明性的和示意性的,而非限制性的;本公开不限于所公开的实施例。通过研究附图、公开内容和所附的权利要求书,本领域技术人员在实践所要求保护的主题时,能够理解和实现对于所公开的实施例的变型。在权利要求书中,词语“包括”不排除未列出的其他元件或步骤,不定冠词“一”或“一个”不排除多个,并且术语“多个”是指两个或两个以上。在相互不同的从属权利要求中记载了某些措施的仅有事实并不表明这些措施的组合不能用来获益。While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive; the disclosure is not limited to the disclosed Example. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps not listed, the indefinite article "a" or "an" does not exclude a plurality, and the term "plurality" means two or more . The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (17)

  1. 一种用于单细胞测序的方法,包括:A method for single cell sequencing comprising:
    获取测序文库中的多条核苷酸序列和所述多条核苷酸序列对应的纳米孔测序信号;Obtaining multiple nucleotide sequences in the sequencing library and nanopore sequencing signals corresponding to the multiple nucleotide sequences;
    基于第一相似性阈值,对所述多条核苷酸序列进行第一聚类,以得到第一多个簇集,其中,所述第一多个簇集包括具有最大簇集尺寸的最大簇集;Performing a first clustering on the plurality of nucleotide sequences based on a first similarity threshold to obtain a first plurality of clusters, wherein the first plurality of clusters includes the largest cluster with the largest cluster size set;
    基于所述多条核苷酸序列对应的纳米孔测序信号的信号长度均值和所述最大簇集中的各核苷酸序列所对应的纳米孔测序信号,确定合并阈值;Determining a merge threshold based on the mean signal length of nanopore sequencing signals corresponding to the plurality of nucleotide sequences and the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster;
    基于第二相似性阈值,对所述多条核苷酸序列进行所述第一聚类,以得到第二多个簇集,其中,所述第一相似性阈值大于所述第二相似性阈值;以及performing the first clustering on the plurality of nucleotide sequences based on a second similarity threshold to obtain a second plurality of clusters, wherein the first similarity threshold is greater than the second similarity threshold ;as well as
    基于所述合并阈值对所述第二多个簇集进行聚类优化,以得到第三多个簇集。Cluster optimization is performed on the second plurality of clusters based on the merging threshold to obtain a third plurality of clusters.
  2. 如权利要求1所述的方法,其中,所述基于第一相似性阈值,对所述多条核苷酸序列进行第一聚类包括:The method according to claim 1, wherein said performing first clustering on said plurality of nucleotide sequences based on a first similarity threshold comprises:
    执行迭代过程直到所述多条核苷酸序列中的待聚类核苷酸序列集合为空,所述迭代过程包括:Perform an iterative process until the set of nucleotide sequences to be clustered in the plurality of nucleotide sequences is empty, the iterative process includes:
    确定所述待聚类核苷酸序列集合的代表序列;determining a representative sequence of the set of nucleotide sequences to be clustered;
    利用短词滤波器过滤所述待聚类核苷酸序列集合;Using a short word filter to filter the set of nucleotide sequences to be clustered;
    响应于经过滤的待聚类核苷酸序列集合为非空,对于经过滤的待聚类核苷酸序列集合中的每一条核苷酸序列:In response to the filtered set of nucleotide sequences to be clustered is non-empty, for each nucleotide sequence in the filtered set of nucleotide sequences to be clustered:
    确定该核苷酸序列与所述代表序列之间的相似性;determining the similarity between the nucleotide sequence and said representative sequence;
    响应于确定所述相似性大于或等于所述第一相似性阈值,添加该核苷酸序列到包括所述代表序列的相似性簇集;In response to determining that the similarity is greater than or equal to the first similarity threshold, adding the nucleotide sequence to a similarity cluster that includes the representative sequence;
    响应于经过滤的待聚类核苷酸序列集合为空,添加所述代表序列到短词簇集;以及In response to the filtered set of nucleotide sequences to be clustered is empty, adding the representative sequence to the short word cluster; and
    从所述待聚类核苷酸序列集合中移除所述相似性簇集和所述短词簇集中的核苷酸序列,以更新所述待聚类核苷酸序列集合,其中,所述迭代过程中得到的各相似性簇集和各短词簇集形成所述第一多个簇集。Remove the nucleotide sequences in the similarity cluster and the short word cluster from the set of nucleotide sequences to be clustered to update the set of nucleotide sequences to be clustered, wherein the Each similarity cluster and each short word cluster obtained in the iterative process form the first plurality of clusters.
  3. 如权利要求2所述的方法,其中,所述确定所述待聚类核苷酸序列集合的代表序列包括:The method according to claim 2, wherein said determining the representative sequence of said set of nucleotide sequences to be clustered comprises:
    确定所述待聚类核苷酸序列集合中具有最长长度的核苷酸序列作为所述代表序列。Determining the nucleotide sequence with the longest length in the set of nucleotide sequences to be clustered as the representative sequence.
  4. 如权利要求1所述的方法,其中,所述确定所述合并阈值包括:The method of claim 1, wherein said determining said merging threshold comprises:
    从所述最大簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第一阈值个纳米孔测序信号;Randomly selecting a first threshold number of nanopore sequencing signals from the nanopore sequencing signals corresponding to each nucleotide sequence in the largest cluster;
    计算所述第一阈值个纳米孔测序信号中每两个纳米孔测序信号之间的第一动态时间规整距离;以及calculating a first dynamic time warping distance between every two nanopore sequencing signals of the first threshold number of nanopore sequencing signals; and
    基于各第一动态时间规整距离的总和、所述多条核苷酸序列所对应的纳米孔测序信号的信号长度均值和所述最大簇集尺寸,确定所述合并阈值。The merging threshold is determined based on the sum of the first dynamic time warping distances, the average signal length of the nanopore sequencing signals corresponding to the multiple nucleotide sequences, and the maximum cluster size.
  5. 如权利要求4所述的方法,The method of claim 4,
    其中,响应于确定所述最大簇集尺寸大于第二阈值,所述第一阈值为所述第二阈值;并且wherein, in response to determining that the maximum cluster size is greater than a second threshold, the first threshold is the second threshold; and
    其中,响应于确定所述最大簇集尺寸小于或等于所述第二阈值,所述第一阈值为所述最大簇集尺寸。Wherein, in response to determining that the maximum cluster size is less than or equal to the second threshold, the first threshold is the maximum cluster size.
  6. 如权利要求1所述的方法,其中,所述聚类优化包括:The method according to claim 1, wherein said clustering optimization comprises:
    基于所述最大簇集尺寸和第三阈值,确定所述第二多个簇集中的第一子集;以及determining a first subset of the second plurality of clusters based on the maximum cluster size and a third threshold; and
    对于所述第一子集中每一个簇集:For each cluster in the first subset:
    从该簇集中的各核苷酸序列所对应的纳米孔测序信号中随机选取第四阈值个纳米孔测序信号;randomly selecting a fourth threshold nanopore sequencing signal from the nanopore sequencing signals corresponding to each nucleotide sequence in the cluster;
    计算从该簇集中随机选取的第四阈值个纳米孔测序信号中的每一个纳米孔测序信号与从所述第一子集中的另一簇集随机选取的第四阈值个纳米孔测序信号之间的相应第二动态时间规整距离;以及calculating the difference between each of a fourth threshold nanopore sequencing signal randomly selected from the cluster and a fourth threshold nanopore sequencing signal randomly selected from another cluster in the first subset The corresponding second dynamic time warping distance of ; and
    响应于确定所述相应第二动态时间规整距离均小于所述合并阈值,合并该簇集和所述另一簇集,以得到经合并的第一子集。In response to determining that each of the respective second dynamic time warping distances is less than the merge threshold, the cluster and the other cluster are merged to obtain a merged first subset.
  7. 如权利要求6所述的方法,其中,所述确定第一子集包括:The method of claim 6, wherein said determining the first subset comprises:
    响应于确定所述最大簇集尺寸大于所述第三阈值,确定所述第二多个簇集中大于所述第三阈值的各簇集,以形成所述第一子集;以及in response to determining that the largest cluster size is greater than the third threshold, determining clusters of the second plurality of clusters greater than the third threshold to form the first subset; and
    响应于确定所述最大簇集尺寸小于或等于所述第三阈值,确定所述第二多个簇集中等于所述最大簇集尺寸的各簇集,以形成所述第一子集。Responsive to determining that the maximum cluster size is less than or equal to the third threshold, each cluster of the second plurality of clusters equal to the maximum cluster size is determined to form the first subset.
  8. 如权利要求6所述的方法,其中,响应于所述第二多个簇集中除经合并的第一子集的第二子集为非空,所述聚类优化进一步包括:The method of claim 6, wherein in response to a second subset of the second plurality of clusters other than the merged first subset being non-empty, the cluster optimization further comprises:
    确定经合并的第一子集中的每一个簇集所对应的一致序列信号;determining a consensus sequence signal corresponding to each cluster in the merged first subset;
    对于所述第二子集中包括的每一个核苷酸序列:For each nucleotide sequence included in the second subset:
    对于每一个一致性序列信号:For each consensus sequence signal:
    计算该核苷酸序列所对应的纳米孔测序信号与该一致性序列信号之间的第三动态时间规整距离;calculating a third dynamic time warping distance between the nanopore sequencing signal corresponding to the nucleotide sequence and the consensus sequence signal;
    响应于确定所述第三动态时间规整距离小于所述合并阈值,添加该核苷酸序列到与该一致序列信号对应的经合并的第一子集中的簇集,以更新经合并的第一子集;以及In response to determining that the third dynamic time warping distance is less than the merging threshold, adding the nucleotide sequence to a cluster in the merged first subset corresponding to the consensus sequence signal to update the merged first subset set; and
    从所述第二子集中移除被添加到经合并的第一子集的各核苷酸序列,以更新所述第二子集。Each nucleotide sequence added to the merged first subset is removed from the second subset to update the second subset.
  9. 如权利要求8所述的方法,其中,响应于经更新的第二子集为非空,所述聚类优化进一步包括:The method of claim 8, wherein, in response to the updated second subset being non-empty, said clustering optimization further comprises:
    基于经更新的第二子集中各核苷酸序列所对应的纳米孔测序信号,对经更新的第二子集进行聚类,以得到至少一个簇集,其中,所述至少一个簇集中的每个簇集中的每两个核苷酸序列所对应的纳米孔测序信号之间的第四动态时间规整距离均小于所述合并阈值,并且Based on the nanopore sequencing signal corresponding to each nucleotide sequence in the updated second subset, cluster the updated second subset to obtain at least one cluster, wherein each of the at least one cluster The fourth dynamic time warping distance between the nanopore sequencing signals corresponding to every two nucleotide sequences in the clusters is less than the merging threshold, and
    其中,更新后的经合并的第一子集和所述至少一个簇集形成所述第三多个簇集。Wherein the updated merged first subset and the at least one cluster form the third plurality of clusters.
  10. 如权利要求9所述的方法,其中,响应于所述第二多个簇集中除所述第三多个簇集的第三子集为非空,所述聚类优化进一步包括:The method of claim 9 , wherein, in response to a third subset of the second plurality of clusters other than the third plurality of clusters being non-empty, the cluster optimization further comprises:
    对于所述第三子集中的每一条核苷酸序列:For each nucleotide sequence in the third subset:
    计算该核苷酸序列所对应的纳米孔测序信号与从所述第三多个簇集中随机选择的一条核苷酸序列所对应的纳米孔信号之间的第五动态时间规整距离;calculating a fifth dynamic time warping distance between the nanopore sequencing signal corresponding to the nucleotide sequence and the nanopore signal corresponding to a nucleotide sequence randomly selected from the third plurality of clusters;
    响应于确定所述第五动态时间规整距离小于所述合并阈值,添加该核苷酸序列到所述第三多个簇集中的包括所述随机选择的核苷酸序列的簇集,以更新所述第三多个簇集;以及In response to determining that the fifth dynamic time warping distance is less than the merge threshold, adding the nucleotide sequence to a cluster of the third plurality of clusters that includes the randomly selected nucleotide sequence to update the the third plurality of clusters; and
    从所述第三子集中移除添加到所述第三多个簇集的各核苷酸序列,以更新所述第三子集。Each nucleotide sequence added to the third plurality of clusters is removed from the third subset to update the third subset.
  11. 如权利要求10所述的方法,其中,响应于经更新的第三子集为非空,所述聚类优化进一步包括:The method of claim 10, wherein, in response to the updated third subset being non-empty, said clustering optimization further comprises:
    将经更新的第三子集中的每一个核苷酸序列归类为一个相应的单独簇集;以及classifying each nucleotide sequence in the updated third subset into a corresponding individual cluster; and
    将各相应的单独簇集添加到经更新的第三多个簇集。Each respective individual cluster is added to the updated third plurality of clusters.
  12. 如权利要求1-11中任一项所述的方法,其中,所述多条核苷酸序列来自多个单细胞,其中,来自相同单细胞的核苷酸序列具有相同的标签,并且来自不同单细胞的核苷酸序列具有不同的标签。The method according to any one of claims 1-11, wherein the multiple nucleotide sequences are from multiple single cells, wherein the nucleotide sequences from the same single cell have the same label and are from different Nucleotide sequences of single cells have different tags.
  13. 如权利要求12所述的方法,其中,所述第三多个簇集与各自相应的标签相关联,所述方法进一步包括:The method of claim 12, wherein the third plurality of clusters are associated with respective labels, the method further comprising:
    基于所述第三多个簇集和与所述第三多个簇集相关联的相应标签,从所述多条核苷酸序列中分离出来自所述多个单细胞中的每个单细胞的核苷酸序列。isolating each single cell from the plurality of single cells from the plurality of nucleotide sequences based on the third plurality of clusters and corresponding labels associated with the third plurality of clusters the nucleotide sequence.
  14. 一种用于单细胞测序的装置,包括用于实现如权利要求1至13中任一项所述的方法的模块。A device for single-cell sequencing, comprising a module for implementing the method according to any one of claims 1-13.
  15. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的至少一个存储器,at least one memory communicatively coupled to said at least one processor,
    其中,所述至少一个存储器存储有指令,所述指令在被所述至少一个处理器执行时,使所述至少一个处理器执行权利要求1-13中任一项所述的方法。Wherein, the at least one memory stores instructions, and the instructions, when executed by the at least one processor, cause the at least one processor to perform the method according to any one of claims 1-13.
  16. 一种存储有指令的非瞬时计算机可读存储介质,其中,所述指令在被计算机的至少一个处理器执行时,使所述计算机执行根据权利要求1-13中任一项所述的方法。A non-transitory computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor of a computer, cause the computer to perform the method according to any one of claims 1-13.
  17. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序在被处理器执行时实现权利要求1-13中任一项所述的方法。A computer program product comprising a computer program, wherein said computer program implements the method of any one of claims 1-13 when executed by a processor.
PCT/CN2021/116704 2021-09-06 2021-09-06 Single-cell sequencing method and apparatus, and device, medium and program product WO2023029044A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/116704 WO2023029044A1 (en) 2021-09-06 2021-09-06 Single-cell sequencing method and apparatus, and device, medium and program product
CN202111481203.3A CN114171117B (en) 2021-09-06 2021-12-06 Method, apparatus, device, medium and program product for sequencing of single cells

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/116704 WO2023029044A1 (en) 2021-09-06 2021-09-06 Single-cell sequencing method and apparatus, and device, medium and program product

Publications (1)

Publication Number Publication Date
WO2023029044A1 true WO2023029044A1 (en) 2023-03-09

Family

ID=80483518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116704 WO2023029044A1 (en) 2021-09-06 2021-09-06 Single-cell sequencing method and apparatus, and device, medium and program product

Country Status (2)

Country Link
CN (1) CN114171117B (en)
WO (1) WO2023029044A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496114B (en) * 2022-11-18 2023-04-07 成都戎星科技有限公司 TDMA burst length estimation method based on K-means clustering

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104321441A (en) * 2012-02-16 2015-01-28 牛津楠路珀尔科技有限公司 Analysis of measurements of a polymer
CN109415765A (en) * 2016-04-14 2019-03-01 昆塔波尔公司 With the hybrid optical signal in the polymer analysis of nano-pore
US20200035325A1 (en) * 2018-07-24 2020-01-30 King Abdullah University Of Science And Technology Continuous wavelet-based dynamic time warping method and system
WO2020084404A1 (en) * 2018-10-25 2020-04-30 King Abdullah University Of Science And Technology System and method for direct subsequence searching and mapping in nanopore raw signal
CN111292806A (en) * 2020-03-27 2020-06-16 武汉古奥基因科技有限公司 Transcriptome analysis method by using nanopore sequencing
US20210139977A1 (en) * 2019-11-07 2021-05-13 Hong Kong Baptist University Method for identifying RNA isoforms in transcriptome using Nanopore RNA reads

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140357497A1 (en) * 2011-04-27 2014-12-04 Kun Zhang Designing padlock probes for targeted genomic sequencing
GB201319779D0 (en) * 2013-11-08 2013-12-25 Cartagenia N V Genetic analysis method
US11495324B2 (en) * 2019-10-01 2022-11-08 Microsoft Technology Licensing, Llc Flexible decoding in DNA data storage based on redundancy codes
WO2017209891A1 (en) * 2016-05-31 2017-12-07 Quantapore, Inc. Two-color nanopore sequencing
CN110111843B (en) * 2018-01-05 2021-07-06 深圳华大基因科技服务有限公司 Method, apparatus and storage medium for clustering nucleic acid sequences
CN110232951B (en) * 2018-12-06 2023-08-01 苏州金唯智生物科技有限公司 Method, computer readable medium and application for judging saturation of sequencing data
CN110600078B (en) * 2019-08-23 2022-03-18 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN112750502B (en) * 2021-01-18 2022-04-15 中南大学 Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104321441A (en) * 2012-02-16 2015-01-28 牛津楠路珀尔科技有限公司 Analysis of measurements of a polymer
CN109415765A (en) * 2016-04-14 2019-03-01 昆塔波尔公司 With the hybrid optical signal in the polymer analysis of nano-pore
US20200035325A1 (en) * 2018-07-24 2020-01-30 King Abdullah University Of Science And Technology Continuous wavelet-based dynamic time warping method and system
WO2020084404A1 (en) * 2018-10-25 2020-04-30 King Abdullah University Of Science And Technology System and method for direct subsequence searching and mapping in nanopore raw signal
US20210139977A1 (en) * 2019-11-07 2021-05-13 Hong Kong Baptist University Method for identifying RNA isoforms in transcriptome using Nanopore RNA reads
CN111292806A (en) * 2020-03-27 2020-06-16 武汉古奥基因科技有限公司 Transcriptome analysis method by using nanopore sequencing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAN RENMIN, LI YU, GAO XIN, WANG SHENG: "An accurate and rapid continuous wavelet dynamic time warping algorithm for end-to-end mapping in ultra-long nanopore sequencing", BIOINFORMATICS, OXFORD UNIVERSITY PRESS , SURREY, GB, vol. 34, no. 17, 1 September 2018 (2018-09-01), GB , pages i722 - i731, XP093041541, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/bty555 *

Also Published As

Publication number Publication date
CN114171117A (en) 2022-03-11
CN114171117B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
Lieberman et al. CaSTLe–classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments
Ali et al. Alignment-free protein interaction network comparison
Wagner et al. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data
Langfelder et al. When is hub gene selection better than standard meta-analysis?
Wang et al. Conditional generative adversarial network for gene expression inference
Alharbi et al. A review of deep learning applications in human genomics using next-generation sequencing data
Li et al. Network embedding-based representation learning for single cell RNA-seq data
Zhang et al. RNA-Skim: a rapid method for RNA-Seq quantification at transcript level
Zhang et al. Critical downstream analysis steps for single-cell RNA sequencing data
Chiu et al. Missing value imputation for microarray data: a comprehensive comparison study and a web tool
Bian et al. Computational tools for stem cell biology
Sharma et al. DeepFeature: feature selection in nonimage data using convolutional neural network
Rautenstrauch et al. Intricacies of single-cell multi-omics data integration
Žitnik et al. Gene network inference by fusing data from diverse distributions
Raimundo et al. Machine learning for single-cell genomics data analysis
Ahmad et al. Integrating heterogeneous omics data via statistical inference and learning techniques
WO2023029044A1 (en) Single-cell sequencing method and apparatus, and device, medium and program product
Wei et al. CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data
Cheng et al. DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data
US20170076036A1 (en) Protein functional and sub-cellular annotation in a proteome
Gao et al. A universal framework for single-cell multi-omics data integration with graph convolutional networks
Maden et al. Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single-cell RNA-sequencing datasets
Wang et al. Progress in single-cell multimodal sequencing and multi-omics data integration
Shu et al. Robust graph regularized NMF with dissimilarity and similarity constraints for ScRNA-seq data clustering
Ji et al. scAnnotate: an automated cell-type annotation tool for single-cell RNA-sequencing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955575

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE