WO2023220621A1 - Long-range dna sequencing through concatenating chimeric amplicon reads - Google Patents

Long-range dna sequencing through concatenating chimeric amplicon reads Download PDF

Info

Publication number
WO2023220621A1
WO2023220621A1 PCT/US2023/066813 US2023066813W WO2023220621A1 WO 2023220621 A1 WO2023220621 A1 WO 2023220621A1 US 2023066813 W US2023066813 W US 2023066813W WO 2023220621 A1 WO2023220621 A1 WO 2023220621A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
mixture
length
stopper
umi
Prior art date
Application number
PCT/US2023/066813
Other languages
French (fr)
Inventor
David Zhang
Kerou ZHANG
Ping Song
Alessandro Pinto
Original Assignee
William Marsh Rice University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by William Marsh Rice University filed Critical William Marsh Rice University
Publication of WO2023220621A1 publication Critical patent/WO2023220621A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the present invention relates generally to the field of molecular biology. More particularly, it concerns compositions and methods for performing long-range, high throughput DNA sequencing.
  • compositions comprising: a 1 st Stopper oligonucleotide comprising, from 5' to 3; a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, a 2nd Stopper oligonucleotide comprising , from 5 !
  • a 2nd Identity Sequence with a length between 5nt and 200nt wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, and a 2nd C Sequence with a length between 6nt and 500nt, a 3rd Stopper oligonucleotide comprising, from 5' to 3', a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein
  • compositions further comprise said Target nucleic acid.
  • a Sample comprising a Target molecule comprising, from 5‘ to 3’, a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1st Binding Region with a template-dependent polymerase, reagents and buffers needed for enzymatic function, a 1 st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3‘, a 1 st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary to the 1st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3‘, a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence
  • a 3rd Identity Sequence with a length between 5nt and 200nt a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3 nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt
  • the 3rd B Sequence is the reverse complement of the 3rd A Sequence
  • the 3rd B Sequence is reverse complementary to the 3rd Match Region
  • a 3rd C Sequence with a length between 6nt and 500nt wherein the 3rd C Sequence is the reverse complement of the 3rd Binding Region
  • the 2nd UMI Sequence comprises a set of designed DNA sequences, wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mix
  • each Amplicon comprises, from 5' to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary' to the Match Region corresponding to the upstream Stopper, and an Identity-Complement Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
  • methods for high throughput DNA sequencing comprise
  • a Sample comprising a Target molecule comprising, from 5' to 3', a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1 st Binding Region with an annealing buffer, a 1st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3', a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary-' to the 1st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3', a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt
  • the 2nd B Sequence is the reverse complement of the 2nd A Sequence
  • the 2nd B Sequence is reverse complementary-' to the 2nd Match Region
  • a 2nd C Sequence with a iength between 6nt and 500nt wherein the 2nd C Sequence is the reverse complement of the 2nd Binding Region
  • a 3rd Stopper oligonucleotide wherein the 3rd Stopper comprises, from 5 !
  • each Amplicon comprises, from 5’ to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary to the Match Region corresponding to the upstream Stopper, and an Identity-Complement Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
  • the methods further comprises:
  • the 1st Identity Sequence comprises a 1st UMI Sequence.
  • the 3rd Identity Sequence comprises a 3rd UMI Sequence.
  • each UMI Sequence comprises a set of designed DNA sequences.
  • the UMI Sequences comprise degenerate nucleotides, selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M: (mixture of A and C).
  • the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5.
  • the 1st Stopper further comprises, from 5' to 3', a 1st A Sequence of between 3 nt and 50nt, a 1st Loop Sequence of between 3 nt and 70nt, and a 1st B Sequence of between 3 nt and 50nt, wherein the 1st B Sequence is the reverse complement of the 1st A Sequence.
  • the potential Target or the Target comprises a 1st Match Region to the 3' of the 1st Binding Region and is complementary to the 1st B Sequence.
  • the composition or the mixture further comprises at least one Inhibitor DNA oligonucleotide having aa subsequence that is reverse complementary to a subsequence of the C Sequence of a corresponding Stopper.
  • the Inhibitor DNA oligonucleotide has a subsequence at the 3‘ end at least 3nt long that is not reverse complementary' to the corresponding Stopper.
  • the subsequence at the 3' end of Inhibitor DNA oligonucleotide forms at least one hairpin structure.
  • the Inhibitor DNA oligonucleotide comprises non-natural nucleotides.
  • the Inhibitor DNA oligonucleotide has a chemical functionalization at the 3' end that, prevents polymerase extension, including but not limited to a 3-carbon spacer, an inverted nucleotide, or a minor groove binder.
  • the annealing step comprises a thermocycling program cooling from a temperature not lower than 78 °C to a temperature not higher than 25 °C.
  • the thermocycling program comprises steps that cool from 78 °C to 25 °C, wherein the solution is held at each 5°C temperature window for at least 5 minutes. In other words, for each 5 minutes of the therm ocy cling program, the program cools no faster than 5 °C per 5 minutes, i.e., spending >5 minutes between 73 °C and 78 °C, spending >5 minutes between 68 °C and 73 °C, etc.
  • the annealing step comprises subjecting the mixture to a temperature for between 10 minutes and 24 hours. In some aspects, the annealing step comprises subjecting the mixture to room temperature for between 10 minutes and 24 hours.
  • the incubation occurs at a temperature between 10 °C and 74 °C for between 1 second and 20 hours. In some aspects, the incubation comprises thermal cycling alternating between a temperature higher than 78 °C for between 1 second and 30 minutes and a temperature not higher than 75 °C for between I second and 20 hours. In some aspects, the method comprises at least 6 thermal cycles.
  • the incubation occurs at a temperature between about 10 °C and about 74 °C, between about 15 °C and about 74 °C, between about 20 °C and about 74 °C, between about 25 °C and about 74 °C, between about 30 °C and about 74 °C, between about 35 °C and about 74 °C, between about 40 °C and about 74 °C, between about 45 °C and about °C, between about 50 °C and about 74 °C, between about 55 °C and about 74 °C, between about 60 °C and about 74 °C, between about 25 °C and about 65 °C, between about 30 °C and about 65 °C, between about 35 °C and about 65 °C, or any range derivable therein.
  • the incubation occurs at a temperature of about 10 °C, 15 °C, 20 °C, 25 °C, 30 °C, 35 °C, 40 °C, 45 °C, 50 °C, 55 °C, 60 °C, 65 °C, 70 °C, or 74 °C, or any value derivable therein.
  • the incubation occurs for between 1 second and 20 hours, between 30 seconds and 20 hours, between 1 minute and 20 hours, between 2 minutes and 20 hours, between 5 minutes and 20 hours, between 10 minutes and 20 hours, between 30 minutes and 20 hours, between 60 minutes and 20 hours, between 2 hours and 20 hours, between 30 seconds and 2 hours, between 60 seconds and 2 hours, between 2 minutes and 2 hours, between 5 minutes and 2 hours, between 10 minutes and 2 hours, between 30 minutes and 2 hours, or any range derivable therein.
  • the incubation occurs for at least 1 second, 10 seconds, 20 seconds, 30 seconds, 45 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 40 minutes, 50 minutes, or 60 minutes and at most 20 hours, 15 hours, 10 hours, 5 hours, 2 hours, 1 hour, 50 minutes, 40 minutes, 30 minutes, 20 minutes, or 10 minutes.
  • the incubation occurs for 1 second, 5 seconds, 10 seconds, 20 seconds, 30 seconds, 40 seconds, 50 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 40 minutes, 50 minutes, 60 minutes, 2 hours, 3 hours, 4 hours, 5 hours, 6 hours, 7 hours, 8 hours, 9 hours, 10 hours, 12 hours, 14 hours, 16 hours, 18 hours, 20 hours, or any valuable derivable therein.
  • the incubation comprises thermal cycling alternating between a temperature higher than 78 °C (e.g., 78 °C, 79 °C, 80 °C, 81 °C, 82 °C, 83 °C, 84 °C, 85 °C, 86 °C, 87 °C, 88 °C, 89 °C, 90 °C, 91 °C, 92 °C, 93 °C, 94 °C, or 95 °C) for between 1 second and 30 minutes (e.g., 1 second, 5 seconds, 10 seconds, 20 seconds, 30 seconds, 40 seconds, 50 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, or any value derivable therein) and a temperature not higher than 75 °C (e.g., 75 °C, 74 °C, 73 °C, 72 °C, 71 °C, 70 °C, 69 °C)
  • the sequencing adapters and/or sequencing indexes are appended via ligation. In some aspects, the sequencing adapters and/or sequencing indexes are appended via PCR.
  • the high-throughput sequencing is performed via sequencing- by-synthesis. In some aspects, the high-throughput sequencing is performed via electrical current measurements in conjunction with a nanopore.
  • the C Sequence of the Stopper and the Target has a standard free energy of hybridization AG°i between -7 kcal/mol and -50 kcal/mol
  • the A Sequence and the B Sequence of each Stopper has a standard free energy of hybridization AG°2 between -2 kcal/mol and -50 kcal/mol
  • the Inhibitor and the C Sequence of the Stopper has a standard free energy of hybridization AG°3 between -7 kcal/mol and -50 kcal/mol.
  • the polymerase is a thermostable polymerase. In some aspects, the polymerase is not a thermostable polymerase. In some aspects, the polymerase is selected from the group consisting of Taq DNA polymerase, Bst DNA Polymerase, or DNA Polymerase I, Hemo Klen Taq, Phusion, Q5, T7 DNA polymerase, and KAPA HiFi.
  • the Target is a biological DNA or RNA molecule.
  • the Target is obtained from a sample of cells, a biofluid, or a tissue.
  • the biofluid is selected from the group consisting of blood, urine, saliva, cerebrospinal fluid, interstitial fluid, and synovial fluid.
  • the tissue is a biopsy tissue or a surgically resected tissue.
  • the Target is a complementary DNA molecule generated through the reverse transcription of an RNA sample.
  • the RN A sample is a biological RNA sample.
  • the biological RNA sample is obtained from a human, animal, plant, or environmental specimen.
  • the Target is an amplicon DNA molecule generated through a DNA polymerase acting on a single-stranded DNA template. In some aspects, the amplicon DNA molecule is generated through multiple displacement amplification of a single cell DNA molecule. In some aspects, the Target is a physically, chemically, or enzymatically generated product of a biological DNA molecule. In some aspects, the Target is the product of a fragmentation process. In some aspects, the fragmentation process is ultrasonication or enzymatic fragmentation.
  • the Target is the product of a bisulfite conversion reaction, an APOBEC (“apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like”) reaction, a TAPS (TET-assisted pyridine borane sequencing) reaction, or other chemical or enzymatic reaction in which cytosine nucleotides are selectively converted to uracils based on methylation status.
  • APOBEC apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like
  • TAPS T-assisted pyridine borane sequencing
  • the Sample comprising the Target nucleic acid is mixed with at least four Stoppers, wherein each Stopper, except the 1st Stopper, comprises, from 5' to 3; an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, a B Sequence with a length between 3 nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt.
  • each Stopper except the 1st Stopper, comprises, from 5' to 3; an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, a B Sequence with a length between 3 nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence
  • each Identity Sequence comprises a UMI Sequence comprising a set of designed DNA sequences, wherein the UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C).
  • the 1 st Stopper comprises, from 5' to 3', a 1st Identity Sequence with length between 5nt and 200nt, and a C Sequence of between 6nt and 500nt.
  • the 1 st Stopper comprises, from 5' to 3‘, a 1 st Identity Sequence with a length between 5nt and 200nt, a 1st A Sequence with a length between 3 nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, a B Sequence with a length between 3nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt.
  • the 1st Identity Sequence comprises a 1st LJMI Sequence comprising a set of designed DNA sequences, wherein the 1st UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C).
  • the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5.
  • the potential Target or the Target comprises a Binding Region that is reverse complementary to a corresponding C Sequence for each of the at least four Stoppers. In some aspects, the potential Target or the Target comprises a Match Region that is reverse complementary to a corresponding B sequence for each of the at least four Stoppers, wherein each Match Region is located to the 3' of the Binding Region corresponding to the same Stopper. In some aspects, the potential Target or the Target comprises a Match Region that is reverse complementary to a corresponding B Sequence.
  • FIG. 1 Reagent components.
  • the three dotted frames denote the 1st Stopper, 2nd Stopper and the 3rd Stopper, respectively.
  • the gray arrows on the right side of the Stoppers and the left side of the Target denote the 3' end of the oligonucleotides.
  • the 1st Stopper has, from 5’ to 3', a 1 st Identity Sequence, and a 1st C Sequence.
  • the 2nd Stopper has, from 5' to 3', a 2nd Identity Sequence, a 2nd A Sequence, a 2nd Loop Sequence, a 2nd B Sequence and a 2nd C Sequence, the 2nd A Sequence is reverse complementary' to the 2nd B Sequence.
  • the 2nd Identity Sequence also has a 2nd I MI Sequence, which is composed of a set of DNA sequences.
  • the 3rd Stopper has, from 5' to 3', a 3rd Identity Sequence, a 3rd A Sequence, a 3rd Loop Sequence, a 3rd B Sequence and a 3rd C Sequence, the 3rd A Sequence is reverse complementary to the 3rd B Sequence.
  • the Target From 5' to 3', the Target has a 3rd Binding Region, a 2nd Binding Region, a 2nd Match Region, and a 1st Binding Region.
  • the 1st Binding Region, 2nd Binding Sequence and the 3rd Binding Sequence is the reverse complement of the 1st C Sequence, the 2nd C Sequence, and the 3rd C Sequence.
  • the 2nd Match Region and 3rd Match Region is respectively reverse complementary to the 2nd B Sequence and 3rd B Sequence.
  • the 2nd Loop Sequence and the 3rd Loop Sequence are illustrated as arcs on the right of the hairpins.
  • the system also includes a template-dependent polymerase.
  • FIG. 2 Exemplary embodiment of the disclosure. Besides the 2nd Stopper, the 1 st and 3rd Identity Sequence could also have a 1st and 3rd UMI Sequence, which is composed of a set of DNA sequences.
  • FIG. 3 Haplotype phasing.
  • the short-read NGS sequencing of which sequencing length is usually less than 600nt, is unable to estimate and identify the haplotype phasing efficiently and accurately.
  • the former Amplicon generated from the 1st Stopper and the 2nd Stopper would cover genetic locus 1, and the latter Amplicon from the 2nd Stopper and the 3rd Stopper carries sequence information of locus 2.
  • FIG. 4 Current limitation in mapping full-length sequence using Next- Generation Sequencing.
  • the Targets will be randomly fragmented and sequenced, then all the reads would be assembled with the attempt to map back to the original molecule. Due to the lack of linkage between each fragment and read, it is tough to identify the reads from the original wildtype Target and those from the original variant Target, thus it is challenging to reconstruct the original molecules.
  • FI €». 5 Necessity of Hong-range chimeric amplicon sequencing.
  • the amplicons generated would contain two UMIs, the former UMI of the amplicon is matched with the latter UMI of the last amplicon, and the latter UMI of the amplicon is matched with the former UMI of the next amplicon.
  • the reads covering amplicons from different regions could be concatenated via the UMI assembly. Then each molecule’s full sequence could be easily mapped back to the sequence of the original wildtype or variant Target.
  • FIG. 6 Mechanism of the long-range chimeric amplicon concatenation.
  • the chimeric lst-2nd Amplicon generated between 1st Stopper and 2nd Stopper, from 5‘ to 3' has the sequence of the 1st Stopper, the lst-2nd Insert Sequence, the 2nd Match- Complement Sequence, and the 2nd Identity-Complement Sequence.
  • the lst-2nd Amplicon comprises both the 1st UMI Sequence and the 2nd UMI Sequence, called one unique molecular pair on lst-2nd Amplicon.
  • the 2nd-3rd Amplicon from the 2nd Stopper and 3rd Stopper has the similar structure: from the 5' to 3', it has the sequence of the 2nd Stopper, the 2nd-3rd Insert Sequence, the 3rd Match-Complement Sequence, and the 3rd Identity- Complement Sequence.
  • the 2nd-3rd Amplicon has the unique molecular pair composed of the 2nd UMI Sequence near its 5' and the 3rd UMI Sequence close to its 3’ end.
  • FIGS. 7A-C Experimental validation of the pairing efficiency.
  • mixture 1 and mixture 2 were pre-annealed.
  • the mixture 1 was composed of 1st Stopper la, 2nd Stopper 2a and the Target 1 nucleic acid.
  • the mixture 2 was composed of 1st Stopper lb and 2nd Stopper 2b and the Target 1 nucleic acid. Then the mixture 1 and mixture 2 were mixed together and proceed to next steps. The detailed sequence is shown in the FIG.
  • the 1st Stopper la and the 1st Stopper lb have the same structure and sequence except for the sequence of the 1st UMI Sequence: the 1st Stopper la has the 1st UMI Sequence la, and the 1st Stopper lb has the 1st UMI Sequence lb.
  • the 2nd Stopper 2a and the 2nd Stopper 2b have the same structure and sequence except that the 2nd Stopper 2a has the 2nd UMI Sequence 2a, and the 2nd Stopper 2b has the 2nd UMI Sequence 2b.
  • FIGS. 8A-C Experimental validation of long-range chimeric amplicon concatenation. As shown in FIG. 8 A, the reaction system is composed of 1st Stopper, 2nd Stopper, 3rd Stopper which ah have binding regions on the Target nucleic acid.
  • the ideal and preferred products are both the lst-2nd Amplicon and 2nd-3rd Amplicon which share the same 2nd UMI Sequence from one Target. There would be some unwanted or side products.
  • the missing attachment of Stopper 2 would lead to the generation of 1 st-3rd WT Amplicon, also the missing or Stopper I or Stopper 3 would result in the lonely lst-2nd Amplicon or the 2nd-3rd Amplicon. No reads indicating the existence of lst-3rd WT Amplicon were found. What is more, the UMI family size of I st ⁇ 2nd Amplicons and the UMI family size of 2nd-3rd Amplicons are close (FIG. 8A).
  • FIGS. 9A-C Experimental validation of variant skipping detection.
  • the reaction system comprises the 1st Stopper, 2nd Stopper, 3rd Stopper and the wildtype Target, thus the l st-2nd Amplicon and the 2nd-3rd Amplicon would be generated.
  • FIG. 9B there are 1st Stopper, 2nd Stopper, 3rd Stopper mixed with the variant Target, only the 1 st-3rd var Amplicon would be generated.
  • FIG. 9C when mixing the wildtype and the variant Target together, the reads and the UMI family sizes of the lst-2nd Amplicon and the 2nd-3rd Amplicon were close to that in wildtype only group. Also, the reads of the lst-3rd Var Amplicon were close to that in the variant only group as well.
  • FIGS. 10A-C MATLAB simulation of the number of reads and molecules needed for N amplicons.
  • the results of MATLAB simulation show that, the need for reads is non-linear but more like an exponential growth as the number of Amplicons goes up.
  • the number of Amplicons is 5, only 0.089M reads are necessary to sequence all the Amplicons with enough number of continuous linked UMI families (100X) and good sequencing depth (15X) on each UMI family.
  • the number of Amplicons becomes 25, 559M reads are necessary for optimal sequencing coverage, which is 6280-fold to the reads needed for 5 Amplicons.
  • FIG. 10B The need of input molecules is shown in the right panel in FIG. 10A with similar growth curve as that in the left panel.
  • the number of Amplicons is 5, only 0.6k input molecules are necessary to obtain enough continuous linked UMI family (100X), but when the number of Amplicons becomes 25, 745k input molecules are necessary for enough continuous linked UMI families, which is 1241-fokl to the molecules needed for 5 Amplicons.
  • FIG. IOC Details of the simulation are shown in FIG.
  • FIGS. 11A-B Bioinformatics pipeline of single-read sequencing. If the length of Sequencing-Insert of the library', i.e., the total length of library deducted with length of Sequencing adaptor, is less than the sequencing length (typically, it ranges from 75nt to 600nt), then single-read sequencing is enough.
  • FIG. 11 A a cartoon illustrates that short Amplicons are able to obtain full-length coverage of sequence information in the Target region. As shown in FIG.
  • each read should compose of the UMI left, the Insert Sequence, UMI right and the Sequencing adaptor sequence when the length of Sequencing-Insert of the library' is less than the sequencing length. Then, the adaptor sequence of each read would be trimmed. The trimmed reads would then be grouped by the same Insert Sequence first, then be clustered to different molecules based on the same UMI set (UMI left + UMI right). Finally, molecules with different Insert Sequences were linked together based on the same UMI sequence. For example, as shown in the bottom panel, if the UAH right sequence of a molecule with Insert Sequence 1 is the same as the UMI left sequence of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and could be linked together.
  • FIGS. 12A-B Bioinformatics pipeline of paired-end sequencing.
  • the length of Sequencing-Insert of the library i.e., the total length of library' deducted with length of Sequencing adaptor, is more than the sequencing length (typically, it ranges from 75nt to 600nt).
  • paired-end sequencing is necessary' to sequence UMIs at both sides of the library.
  • a cartoon shown in FIG. 12A illustrates that sequencing of long Amplicons would only achieve partial coverage of sequence information in the Target region. As shown in FIG.
  • each R1 should compose of the UMI left and partial sequence of the Insert Sequence
  • each R2 is composed of the UMI right and partial sequence of the Insert Sequence.
  • Each UMI set is a UN1I left and a UMI right from a pair of R1 and R2. Then each pair of reads would be aligned back to the Target sequence, then be grouped by the same alignment result. Next, each group with the same alignment result would be clustered to different molecules based on the same sequence of UMI set. Finally, molecules with different alignments were linked together based on the same UMI sequence. For example, as shown in the bottom panel, if the UMI right sequence of a molecule with Insert Sequence 1 is the same as the UMI left sequence of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and could be linked together.
  • FIG. 13 Two embodiments of the invention.
  • the system may further include at least one Inhibitor oligonucleotide.
  • the vertical line on the left side of each Inhibitor denotes the 3' of Inhibitor is non-extensible because of either the rationally designed sequence or chemical modification.
  • the Inhibitor is fully reverse complementary to the corresponding C Sequence of the Stopper.
  • the Inhibitor is the reverse complement of partial corresponding C Sequence of the Stopper.
  • FIGS. 14A-B Two embodiments of the 1st Stopper and the Target
  • the 1st Stopper could further comprise a 1st A Sequence, a 1st Loop Sequence, and a 1st B Sequence.
  • the 1 st A Sequence is reverse complementary' to the 1st B Sequence.
  • the Target further comprises a 1st Match Region that, is reverse complementary to the 1st B Sequence.
  • FIGS. 15A-B Embodiment of (N-l) concatenated chimeric Amplicons.
  • N-l Amplicons generated: lst-2nd Amplicon, 2nd-3rd Amplicon, ... (N-l)th-Nth Amplicon.
  • the lst-2nd Amplicon has a hairpin structure and sequence.
  • reagents and methods for achieving long-range DNA Sequencing through concatenating chimeric amplicon reads Each chimeric amplicon read is formed by template switching on a pattern of sequence complementarities. Chimeric amplicon reads can be concatenated by sharing the same 5' or 3' UMI sequence with neighboring reads and being mapped back to the original molecule with an accurate molecule count.
  • NGS Next-Generation Sequencing
  • This technology solves the short-read sequencing problem of NGS by extending the total length of the sequence-able molecule. This technology also takes advantage of the high accuracy in NGS to compensate for the low sequencing quality of long-read sequencing, like Third- Generation Sequencing.
  • compositions and methods for achieving long-range DNA Sequencing using short-read NGS (Illumina) sequencing allow for sequencing molecules with lengths up to SOOOnt. using typical short-read (75nt to 600nt.) sequencing.
  • Unique Molecular Identifiers (UMI) can tag and quantify each original molecule. The comprehensive construction of the whole length of the original molecule is achieved by concatenating the UMI barcoded short reads. This provides high-quality sequence information for the full length of long molecules as well as quantifying the initial molecule number in the NGS short-read platform with high accuracy.
  • Targets are randomly fragmented and sequenced. Then, all the reads are assembled with an attempt to map back to the original molecule. Due to the lack of linkage between each fragment and read, it can be difficult to identify whether the reads were derived from the original wildtype Target or from a variant Target. For example, as illustrated in FIG. 4, if Target 1 has Regions 1, 3, 4, and 6, and Target 2 has Regions 1, 2, 3, 4, 5, and 6, it is challenging to reconstruct the original wildtype and variant Target molecules. As such, the lack of information denoting the origin of each sheared fragments leads to complex alignment of sequenced reads and uncertainty of the mapped results.
  • every amplicon generated contains two UMIs, where the upstream UMI of the amplicon is matched with the downstream UMI of the next upstream amplicon, and the downstream UMI of the amplicon is matched with the upstream UMI of the next downstream amplicon (FIG. 5).
  • the reads covering amplicons from different regions can be concatenated via UMI assembly.
  • each molecule’s full sequence can be easily mapped back to the sequence of the original wildtype Target or variant Target.
  • Exemplary detailed mechanisms of concatenating chimeric amplicon reads are shown in FIG. 6.
  • the reagent systems provided herein comprise a 1st Stopper, 2nd Stopper, and 3rd Stopper, which all have binding regions on the Target nucleic acid.
  • the Target is defined as the nucleic acid molecule to which the Stopper bind, and which serves as the initial template for the polymerase extension.
  • the 1st Stopper, 2nd Stopper and the 3rd Stopper are rationally designed to bind to the Target sequence and have a pattern of sequence complementarities so that the polymerase can achieve template switching during the extension. Any additional 5' sequences on the Stopper are incorporated into the Amplicon, in addition to the sequences on the Target between the former Stopper and the binding region of Target and the latter Stopper.
  • the 1 st Stopper is a nucleic acid species that comprises, from 5' to 3', a 1 st Identity Sequence and a 1st C Sequence (FIG. 1).
  • the 1st Identity Sequence may comprise a 1 st UMI Sequence, which comprises a set of designed DNA sequences having at least 100 members (FIG. 2).
  • the 1 st Stopper may further comprise a 1st A Sequence, a 1st Loop Sequence, and a 1st B Sequence, where the 1st A Sequence is reverse complementary to the 1st B Sequence (FIG. 14).
  • the 2nd Stopper is a nucleic acid species that comprises, from 5' to 3', a 2nd Identity Sequence, a 2nd A Sequence, a 2nd Loop Sequence, a 2nd B Sequence, and a 2nd C Sequence (FIG. 1).
  • the 2nd A Sequence is reverse complementary to the 2nd B Sequence, such that the 2nd A Sequence and the 2nd B Sequence form a hairpin stem, with the 2nd Loop Sequence between the 2nd A Sequence and the 2nd B Sequence.
  • the 2nd Identity Sequence comprises a 2nd UMI Sequence, which is composed of a set of DNA sequences having at least 100 members.
  • the 3rd Stopper is a nucleic acid species that, comprises, from 5' to 3’, a 3rd Identity Sequence, a 3rd A Sequence, a 3rd Loop Sequence, a 3rd B Sequence, and a 3rd C Sequence.
  • the 3rd A Sequence is reverse complementary to the 3rd B Sequence, with the 3rd Loop Sequence between the 3rd A Sequence and the 3rd B Sequence.
  • the 3rd Identity Sequence may comprise a 3rd UMI Sequence, which comprises a set of designed DNA sequences having at least 100 members.
  • the length of a C sequence is between 6nt and 500nt, between 6nt and 400nt, between 6nt and 300nt, between 6nt and 200nt, between 6nt and lOOnt, between 6nt and 75nt, between 6nt and 50nt, between 6nt and 25nt, between 6nt and 15nt, between 15nt and 500nt, between 15nt and 400nt, between 15nt and 300nt, between 15nt and 200nt, between 15nt and lOOnt, between 15nt and 75nt, between 15nt and 50nt, between 15nt and 25nt, between 30nt and 500nt, between 30nt and 400nt, between 30nt and 300nt, between 30nt and 200nt, between 30nt and lOOnt between 30nt and 75nt, between 30nt and 50nt, or any range derivable therein.
  • the length of a C sequence is at least 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 2.5m, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, 200nt, 250nt, 300nt, 350nt, 400nt, or 450nt and at most 500nt, 450nt, 400nt, 350nt, 300nt, 250nt, 200nt, 150nt, lOOnt, 90nt, 80nt, 70nt, 60nt, 50nt, 40nt, 30nt, 25nt, 20nt, or 15nt.
  • the length of the Binding Region is 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, 200nt, 250nt, 300nt, 350nt, 400nt, 450nt, or 500nt, or any value derivable therein.
  • the length of an A sequence or a B sequence is between 3 nt and 5 Ont, between 3 nt and 40nt, between 3nt and 3 Ont, between 3 nt and 25nt, between 3 nt and 20nt, between 3 nt and 15nt, between 3 nt and lOnt, between 3 nt and 5nt, between 5nt and 50nt, between 5nt and 40nt, between 5nt and 30nt, between 5nt and 25nt, between 5nt and 20nt, between 5nt and I 5nt, between 5nt and lOnt, between lOnt and 50nt, between lOnt and 40nt, between lOnt and 30nt, between lOnt and 25nt, between lOnt and 20nt, between lOnt and 15nt, between 15nt and 50nt, between 15nt and 40nt, between 15nt and 30nt, between 15nt and 30nt, between
  • the length of an A sequence or B sequence is 3 nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int., 12nt, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 2.0m, 2Int, 22nt, 23nt, 24nt, 25nt, 26nt, 27nt, 28nt, 29nt, 30nt, 31nt, 32nt, 33nt, 34nt, 35nt, 36nt, 37nt, 38nt, 39nt, 40nt, 41nt, 42nt, 43nt, 44nt, 45nt, 46nt, 47nt, 48nt, 49nt, or 50nt.
  • the length of an Identity Sequence is between 5nt and 200nt, between 5nt and I50nt, between 5nt and lOOnt, between 5nt and 75nt, between 5nt and 50nt, between 5nt and 25nt, between 5nt and 15nt, between 15nt and 200nt, between 15nt and 150nt, between 15nt and lOOnt, between 15nt and 75nt, between 15nt and 50nt, between 15nt and 25nt, between 30nt and 200nt, between 30nt and 150nt, between 30nt and lOOnt, between 30nt and 75nt, between 30nt and 50nt, or any range derivable therein.
  • the length of an Identity Sequence is at least 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, or 175nt and at most 200nt, 150nt, lOOnt, 90nt, 80nt, 70nt, 60nt, 50nt, 40nt, 30nt, 25nt, 20nt, or 15nt.
  • the length of the First Sequence is 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, I Int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 11 Ont, 120nt, 13Ont, I40nt, 150nt, 160nt, I70nt, 180nt, 190nt, 200nt, or any value derivable therein.
  • the length of a Loop Sequence is between 3nt and 70nt, between 3nt and 60nt, between 3nt and 50nt, between 3nt and 40nt, between 3nt and 30nt, between 3nt and 25nt, between 3 nt and 20nt, between 3 nt and 15nt, between 3nt and lOnt, between 3nt and 5nt, between 5nt and 70nt, between 5nt and 60nt, between 5nt and 50nt, between 5 nt and 40nt, between 5nt and 3 Ont, between 5nt and 25nt, between 5nt and 20nt, between 5 nt and 15nt, between 5nt and lOnt, between lOnt and 70nt, between lOnt and 60nt, between lOnt and 50nt, between lOnt and 40nt, between lOnt and 30nt, between lOnt and 25nt,
  • the length of a Loop Sequence is 3nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt,
  • a portion of the Loop Sequence is complementary to a region of the Target nuclei c acid positioned immedi ately 3' of the corresponding Match Region.
  • This portion of the Loop Sequence may have a length of Int, 2nt, 3nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, I Ont., l int, 12m, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt, or more.
  • the Target is a nucleic acid species that comprises, from 5' to 3’, a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region, and a 1 st Binding Region.
  • the 1st Binding Region, 2nd Binding Region, and the 3rd Binding Region are the reverse complements of the 1 st C Sequence, the 2nd C Sequence, and the 3rd C Sequence, respectively.
  • the 2nd Match Region and 3rd Match Region are the reverse complements of the 2nd B Sequence and 3rd B Sequence, respectively.
  • the 3rd A Sequence is rationally designed to have the same sequence as the 3rd Match Region, and the 3rd Match Sequence is reverse complementary to the 3rd B Sequence of the 3rd Stopper.
  • the 2nd A Sequence is rationally designed to have the same sequence as the 2nd Match Region, and the 2nd Match Sequence is reverse complementary to the 2nd B Sequence of the 2nd Stopper.
  • the 2nd Loop Sequence and the 3rd Loop Sequence are illustrated as arcs on the right of the hairpins. If the 1 st Stopper has a 1 st B sequence, then the Target may further comprise a 1st Match Region that is reverse complementary to the 1st B Sequence.
  • the system also includes a templatedependent polymerase.
  • the Target which serves as the initial template for the polymerase extension, is the nucleic acid molecule with which the Stopper hybridizes.
  • the downstream Stopper oligonucleotides are rationally designed to hybridize to the Target sequence and have a pattern of sequence complementarities such that the extending polymerase switches to recognizing the Stopper as the template at the loci where the Target and the Stopper are bound. Any additional 5' sequences on the Stopper are incorporated into the Amplicon, in addition to the sequences on the Target between, and including, the upstream Binding Region and the Match Region.
  • the polymerase extends the 1st Stopper along the Target and then switches templates with the help of the 2nd Stopper, generating a lst-2nd Amplicon.
  • the polymerase generates the 2nd-3rd Amplicon based on the 2nd and the 3rd Stopper.
  • the downstream UMI of the lst-2nd Amplicon will be the reverse complement of the 2nd UMI Sequence of the 2nd Stopper, which can be used to define the lst-2nd-3rd UMI linked molecules (FIG. 6).
  • the mechanism of induced template switching is as follows. First, the upstream Stopper is bound to the Target. The base pairs formed between the Target and the downstream Stopper at the Match Region will form base stacks with the base pairs formed by the Stopper’s hairpin comprising the A Sequence and the B Sequence. When the polymerase recognizes the Target as the template and extends the upstream Stopper to the Targetdownstream Stopper binding junction, the polymerase finishes extending on the Match Region but is unable to further extend, due to the crossover geometry' present.
  • the multi- stranded molecule can spontaneously rearrange via branch migration such that the 3’ end of the polymerase extension product bridges over the crossover junction and binds to the A Sequence of the downstream Stopper.
  • the polymerase is then able to continue extending, now recognizing the downstream Stopper as the template.
  • the chimeric Amplicon finishes extension, and has its 5' sequence dependent on the upstream Stopper and the Target and its 3‘ sequence dependent on the downstream Stopper. If the B Sequence of the Stopper is not complementary to the Match Region, then the spontaneous rearrangement is not possible, and polymerase extension stalls at the locus where the downstream Stopper binds the Target. This is described in further detail in U.S. Provisional Application No, 63/182,154, filed April 30, 2021, which is incorporated by reference herein in its entirety'.
  • the chimeric l st-2nd Amplicon generated between 1 st Stopper and 2nd Stopper, from 5' to 3', has the sequence of the 1st Stopper, the lst-2nd Insert Sequence, the 2nd Match-Complement Sequence, and the 2nd Identity-Complement Sequence (FIG. 6).
  • the lst-2nd Amplicon comprises both the 1st I. X II Sequence and the reverse complement of the 2nd UMI Sequence, called one unique molecular pair on lst-2nd Amplicon.
  • the 2nd-3rd Amplicon from the 2nd Stopper and 3rd Stopper has a similar structure: from 5' to 3’, it has the sequence of the 2nd Stopper, the 2nd-3rd Insert Sequence, the 3rd Match-Complement Sequence, and the 3rd Identity-Complement Sequence.
  • the 2nd- 3rd Amplicon has the unique molecular pair composed of the 2nd UMI Sequence near its 5‘ and the reverse complement of the 3rd UMI Sequence close to its 3' end.
  • the 1st- 2nd Amplicon comprises the 1st UMI Sequence at its 5' end, and the reverse complement of 2nd UMI Sequence at its 3’ end, which are defined as a unique molecular pair.
  • the 2nd-3rd Amplicon also comprises the unique molecular pair: the 2nd UMI Sequence at its 5' end, the reverse complement of the 3rd UMI Sequence at its 3' end.
  • the unique molecular pair of the lst-2nd Amplicon has the reverse complement of the 2nd UMI Sequence, which is present on the 2nd ⁇ 3rd Amplicon, then these 2 Amplicons would be identified as neighboring amplicons from the same original Target molecule (FIG. 6).
  • the system further comprises at least one Inhibitor oligonucleotide (FIG. 13).
  • the vertical line on the left side of each Inhibitor indicates that the 3' of the Inhibitor is non-extensible because of either a rationally designed sequence or chemical modification.
  • the Inhibitor may be fully reverse complementary to the corresponding C Sequence of the Stopper.
  • the Inhibitor may be the reverse complement of part of the corresponding C Sequence of the Stopper.
  • At least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least 12, at least 14, at least 16, at least 18, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, or at least 50 different Stoppers can be rationally designed along a Target.
  • N Stoppers there would be N-l Amplicons generated: lst-2nd Amplicon, 2nd-3rd Amplicon, 3rd-4th Amplicon, ... (N-l)th- Nth Amplicon (FIG. 15).
  • this long-range DNA sequencing method could be used to profile RNA isoforms from alternative splicing.
  • Each Stopper would be rationally designed to target specific exons. If an exon of a transcript is skipped or a subregion of a transcript is alternatively spliced, there would be one or more Stoppers failing to bind to the Target.
  • MET exon 14 skipping is a known marker of NSCLC (non-small cell lung cancer).
  • the technology provided herein could be used to detect the MET exon 14 skipping by rationally designing Stoppers to target exon 13, 14, and 15 of MET. If exon 14 of MET is skipped, then the middle Stopper would fail to bind to the Target, and variant Amplicon generated from the first and last Stoppers would be detected and sequenced.
  • the long-range DNA sequencing methods provided herein can be used to determine the haplotype phase in which a combination of alleles or a set of single nucleotide polymorphisms (SNPs) is found on the same chromosome.
  • FIG. 3 show's an application of the methods disclosed herein for haplotype phasing. When the distance of two loci is over 600nt, traditional short-read NGS sequencing method is unable to estimate the haplotype phasing.
  • 3 Stoppers (1st Stopper, 2nd Stopper with 2nd I M l Sequence, and 3rd Stopper) are rationally designed to generate two chimeric Amplicons, where the former Amplicon generated from the 1st Stopper and the 2nd Stopper contains information of genetic locus 1 , and the latter Amplicon generated from the 2nd Stopper and the 3rd Stopper covers genetic locus 2.
  • these methods are able to confidently pair the former and latter Amplicons by the complementary 2nd UMI Sequences and to accurately quantitate the number of molecules by the number of detected 2nd UMI Sequence.
  • ‘"essentially free,” in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts.
  • the total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below 7 0.01%.
  • Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.
  • the complementarity relationships described herein are not necessarily 100% complementarity. In some embodiments, the complementarity relationships described herein are >95%, >90%, >85%, or >80% complementarity. In other words, if a first sequence is defined as being complementary to a second sequence, then the reverse complement of the first sequence may be 100%, >95%, >90%, >85%, or >80% identical to the second sequence. By way of example, if one sequence is defined as being at least 80% complementary' to another sequence, then that sequence is at least 80% identical to the reverse complement of the other sequence. [0071] “Identity” or “homology” refers to sequence similarity between two nucleic- acid molecules.
  • Identity can be determined by comparing a corresponding position in each sequence or by comparing an alignment of the sequences being compared. When a position in the compared sequences is occupied by the same base, then the molecules are identical at that position.
  • a degree of identity between sequences can be a function of the number of matching or homologous positions shared by the sequences. “Unrelated” or “non- complementary” sequences share less than 40% identity, or alternatively less than 25% identity. Sequence identity can refer to a % identity of one sequence to another sequence.
  • sequences when sequences are defined as being complementary, then the reverse complement of one of the sequences will be at least 50%, at least 60%, at least 70%, at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100% identical to the other sequence.
  • One particular example of algorithms that are suitable for determining percent sequence identity is the BLAST and BLAST 2.0 algorithms, which are described in Altschul et al. (1977) Nucl. Acids Res. 25:3389-3402 and Altschul et al. (1990) J. Mol. Biol. 215:403-410, respectively.
  • BLAST and BLAST 2.0 can be used, for example, to determine percent sequence identity for two or more polynucleotide sequences.
  • Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information.
  • two nucleic acid molecules are complementary if they can hybridize with each other under stringent conditions.
  • stringent conditions are those conditions that allow hybridization between or within one or more nucleic acid strand(s) containing complementary sequence(s), but precludes hybridization of random sequences. Stringent conditions tolerate little, if any, mismatch between a nucleic acid and a target strand. Such conditions are well known to those of ordinary skill in the art, and are preferred for applications requiring high selectivity.
  • stringent conditions may comprise low salt and/or high temperature conditions, such as provided by about 0.02 M to about 0.15 M NaCl at temperatures of about 50°C to about 70°C.
  • the temperature and ionic strength of a desired stringency are determined in part by the length of the particular nucleic acid(s), the length and nucleobase content of the sequence(s), the charge composition of the nucleic acid(s), and to the presence or concentration of solvent(s) in a hybridization mixture. It is also understood that these ranges, compositions and conditions for hybridization are mentioned by way of non-limiting examples only, and that the desired stringency for a particular hybridization reaction is often determined empirically by comparison to one or more positive or negative controls. In some embodiments, two nucleic acid molecules are complementary if they can hybridize with each other under low stringency conditions.
  • Non-limiting examples of low' stringency include hybridization performed at about 0.15 M to about 0.9 M NaCl at a temperature range of about 20°C to about 50°C.
  • hybridization performed at about 0.15 M to about 0.9 M NaCl at a temperature range of about 20°C to about 50°C.
  • two nucleic acid molecules are non-complementary if they are unable to hybridi ze with each other under low stringency conditions.
  • the operational temperature may be about 20°C, about 25°C, about 30°C, about 35°C, about 40°C, about 45°C, about 50°C, about 55°C, about 60°C, about 65°C, or about 70°C.
  • the operational buffer conditions may be buffer conditions suitable for PCR, such as, for example, a salinity of 0.2 M sodium.
  • Amplification refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 2-100 “cycles” of denaturation and replication.
  • PCR Polymerase chain reaction
  • PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates.
  • the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument.
  • the annealing and extension steps may be combined into a single step.
  • Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g., exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).
  • Primer means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed.
  • the sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide.
  • primers are extended by a DNA polymerase.
  • Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 6 to 100 nucleotides in length, such as 6 to 70, 10 to 50, 10 to 75, 15 to 60, 15 to 40, 15 to 45, 18 to 30, 18 to 40, 20 to 30, 20 to 40, 21 to 25, 21 to 50, 22 to 45, 25 to 40, and any length between the stated ranges.
  • the primers are usually not more than about 6, 7, 8, 9, 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.
  • “Incorporating,” as used herein, means becoming part of a nucleic acid polymer.
  • the term “in the absence of exogenous manipulation” as used herein refers to there being modification of a nucleic acid molecule without changing the solution in which the nucleic acid molecule is being modified. In specific embodiments, it occurs in the absence of the hand of man or in the absence of a machine that changes solution conditions, which may also be referred to as buffer conditions. However, changes in temperature may occur during the modification.
  • a “nucleoside” is a base-sugar combination, i.e., a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide.
  • the nucleotide deoxyuridine triphosphate, dUTP is a deoxyribonucleoside triphosphate.
  • dUTP deoxyuridine triphosphate
  • dUMP deoxyuridine monophosphate.
  • one may say that one incorporates deoxyuridine into DNA even though that is only a part of the substrate molecule.
  • Nucleotide is a term of art that refers to a base-sugar- phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e., of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
  • ribonucleotide triphosphates such as rATP, rCTP, rGTP, or rUTP
  • deoxyribonucleotide triphosphates such as dATP, dCTP, dUTP, dGTP, or dTTP.
  • nucleic acid or “polynucleotide” will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g., adenine “A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil “U” and C).
  • nucleobase such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g., adenine “A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil “U” and C).
  • nucleic acid encompasses the terms “oligonucleotide” and “polynucleotide.” “Oligonucleotide,” as used herein, refers collectively and interchangeably to two terms of art, “oligonucleotide” and “polynucleotide.” Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein.
  • adapter may also be used interchangeably with the terms “oligonucleotide” and “polynucleotide.”
  • the term “adapter” can indicate a iinear adapter (either single stranded or double stranded) or a stem-loop adapter. These definitions generally refer to at least one singlestranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary/ to at least one single-stranded molecule.
  • a nucleic acid may encompass at least one double-stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or “complements)” of a particular sequence comprising a strand of the molecule.
  • a single stranded nucleic acid may be denoted by the prefix “ss,” a double-stranded nucleic acid by the prefix “ds,” and a triple stranded nucleic acid by the prefix “ts.”
  • a “nucleic acid molecule” refers to any single-stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof.
  • the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2 '-deoxyribose group.
  • the nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA.
  • mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase.
  • a nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc.
  • a nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc.
  • a nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc.
  • a nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
  • Nucleic acid(s) that are “complementary” or “complement(s)” are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules.
  • the term “complementary” or “complement(s)” may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above.
  • substantially complementary may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase.
  • a “substantially complementary” nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about. 81%, about
  • nucleobase sequence 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
  • substantially complementary' refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent, conditions.
  • a “partially complementary” nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least, one single or double-stranded nucleic acid molecule during hybridization.
  • non-complementary refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.
  • degenerate refers to a nucleotide or series of nucleotides wherein the identity can be selected from a variety of choices of nucleotides, as opposed to a defined sequence. In specific embodiments, there can be a choice from two or more different nucleotides. In further specific embodiments, the selection of a nucleotide at. one particular position comprises selection from only purines, only pyrimidines, or from nonpairing purines and pyrimidines.
  • secondary structure refers to the set of interactions between bases pairs. For example, in a DNA double helix, the two strands of DNA are held together by hydrogen bonds.
  • the secondary/ structure is responsible for the shape that the nucleic acid assumes.
  • the simplest secondary structure is linear.
  • no two subsequences of a nucleic acid molecule form an intramolecular structure stronger than -2 kcal/mol.
  • one portion of the nucleic acid molecule may hybridize with a second portion of the same nucleic acid molecule, thereby forming a hairpin to stem loop secondary structure.
  • a non-linear secondary structure at least two subsequences of a nucleic acid molecule from an intramolecular structure stronger than -2 kcal/mol.
  • sequence refers to a sequence of at least 5 contiguous base pairs.
  • mutant DNA Template or “variant DNA Template” refer to the nucleotide sequence of a nucleic acid that harbors a desired allele, such as a single nucleotide polymorphism, to be amplified, identified, or otherwise isolated.
  • wildtype sequence or “background sequence” refers to the nucleotide sequence of a nucleic acid that does not harbor the desired allele. For example, in some instances, the background sequence harbors the wild-type allele whereas the variant sequence harbors the mutant allele.
  • the background sequence and the variant sequence are derived from a common locus in a genome such that the sequences of each may be substantially homologous except for a region harboring the desired allele, nucleotide or group or nucleotides that varies between the two.
  • Sample means a material obtained or isolated from a fresh or preserved biological sample or synthetically created source that contains nucleic acids of interest.
  • Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest.
  • Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
  • secondary structure refers to the set of interactions between bases pairs. For example, in a DNA double helix, the two strands of DNA are held together by hydrogen bonds. The secondary' structure is responsible for the shape that the nucleic acid assumes. For a single stranded nucleic acid, the simplest secondary structure is linear. For a linear secondary’ structure, no two subsequences of a nucleic acid molecule form an intramolecular structure stronger than -2 kcal/mol. As another example for a single stranded nucleic acid, one portion of the nucleic acid molecule may hybridize with a second portion of the same nucleic acid molecule, thereby forming a hairpin to stem loop secondary structure. For a non-linear secondary structure, at least two subsequences of a nucleic acid molecule from an intramolecular structure stronger than -2 kcal/mol.
  • a ‘‘Target” for a chimeric amplification system described herein can be any single-stranded nucleic acid, such as single-stranded DNA and single-stranded RNA, including double-stranded DNA and RNA rendered single-stranded through heat shock, asymmetric amplification, competitive binding, and other methods standard to the art.
  • a DNA Target may be the product of RNA subjected to reverse transcription.
  • a Target may be a mixture (chimera) of DNA and RNA.
  • a Target comprises artificial nucleic acid analogs.
  • a Target may be naturally occurring (e.g., genomic DNA) or it may be synthetic (e.g., from a genomic library).
  • a “naturally occurring” nucleic acid sequence is a sequence that is present in nucleic acid molecules of organisms or viruses that exist in nature in the absence of human intervention or that is present in any biological sample.
  • a Target is genomic DNA, messenger RNA, ribosomal RNA, cell-free DNA, micro-RNA, pre-micro- RNA, pro-mi cro-RN A, long non-coding RNA, small RNA, epigenetically modified DNA, epigenetically modified RNA, viral DNA, viral RNA or piwi-RNA.
  • a Target nucleic acid is a nucleic acid that naturally occurs in an organism or vims.
  • a Target nucleic is the nucleic acid of a pathogenic organism or virus.
  • a Target of interest is linear, while in other instances, a Target is circular (e.g., plasmid DNA, mitochondrial DNA, or plastid DNA).
  • a Target nucleic acid molecule of interest is about 19 to about. 1,000,000 nucleotides (nt) in length. In some instances, the Target is about 19 to about 100, about 100 to about 1000, about 1000 to about 10,000, about 10,000 to about 100,000, or about 100,000 to about 1,000,000 nucleotides in length.
  • the Target is about 20, about 100, about 200, about 300, about 400, about 500, about 600, about 700, about 800, about 900, about 1,000, about 2,000, about 3,000, about 4,000, about 5,000, about 6,000, about 7,000, about 8,000, about 9000, about 10,000, about 20,000, about 30,000, about 40,000, about 50,000, about 60,000, about 70,000, about 80,000, about 90,000, about 100,000, about 200,000, about 300,000, about 400,000, about 500,000, about 600,000, about 700,000, about 800,000, about 900,000, or about 1,000,000 nucleotides in length.
  • the Target nucleic acid may be provided in the context of a longer nucleic acid (e.g., such as a coding sequence or gene within a chromosome or a chromosome fragment).
  • Biological sample means a material obtained or isolated from a fresh or preserved biological sample or synthetically created source that contains nucleic acids of interest.
  • Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest.
  • Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
  • substantially known refers to having sufficient sequence information in order to permit preparation of a nucleic acid molecule, including its amplification. This will typically be about 100%, although in some embodiments some portion of an adapter sequence is random or degenerate. Thus, in specific embodiments, substantially known refers to about 50% to about 100%, about 60% to about 100%, about 70% to about 100%, about 80% to about 100%, about 90% to about 100%, about 95% to about 100%, about 97% to about 100%, about 98% to about 100%, or about 99% to about 100%.
  • kits comprising Stopper oligonucleotides as disclosed herein.
  • Exemplary' kits include qPCR kits, Sanger kits, NGS panels, and nanopore sequencing panels.
  • a “kit” refers to a combination of physical elements.
  • a kit may include, for example, one or more components such as nucleic acid Stoppers, nucleic acid Inhibitors, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the invention.
  • T he components of the kits may be packaged either in aqueous media or in lyophilized form.
  • the container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g., aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial.
  • the kits of the present invention also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained.
  • a kit will also include instractions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented.
  • the reaction system was composed of a 1st Stopper, 2nd Stopper, and 3rd Stopper which all have binding regions on the M13 genome (Target nucleic acid) (FIG. 8A),
  • the system was designed such that the T4 Polymerase would extend from the 1 st Stopper and switch templates with the help of the 2nd Stopper, generating lst-2nd Amplicon, Similarly, the T4 Polymerase would generate the 2nd-3rd Amplicon based on the 2nd and the 3rd Stopper.
  • the lst-2nd Amplicon and the 2nd-3rd Amplicon, which share the same 2nd UMI Sequence are defined as lst-2nd-3rd UMI linked molecules.
  • Summary of NGS data is shown in FIG. 8A.
  • the size of unique molecular pairs of lst-2nd Amplicons is 3545, close to that of 2nd-3rd Amplicons, which is 3983.
  • the amount and percentage of Amplicon pairs where the lst-2nd Amplicon shares the same 2nd Identity Sequence with the 2nd-3rd Amplicon was calculated. About 2119 molecules, that is 59.77% of the Amplicons, could be linked together by the same 2nd UMI Sequences.
  • the UMI family size distribution plots of lst-2nd Amplicon and the 2nd-3rd Amplicon are shown FIG. 8C,
  • Stopper 2 The missing attachment of Stopper 2 would lead to the generation of lst-3rd WT Amplicon, if by any chance, the 2nd Stopper falls off after annealing and before polymerase extending, there could be some unwanted or side products. No reads were found indicating the existence of lst-3rd WT Amplicon (see FIG. 8B), which matches the results shown in FIG. 7C.
  • FIGS. 9A-C Experimental demonstration on the feasibility of detecting both the wildtype and variant Target nucleic simultaneously is shown in FIGS. 9A-C.
  • the reaction system comprised the 1st Stopper, 2nd Stopper, and 3rd Stopper and the wildtype Target (FIG. 9 A).
  • the lst-2nd Amplicon and the 2nd-3rd Amplicon should be generated.
  • 1st Stopper, 2nd Stopper, and 3rd Stopper were mixed with the variant Target, only the lst-3rd var Amplicon should be generated (FIG. 9B).
  • FIGS. 10A-C MATLAB simulation of the number of reads and molecules needed for N amplicons are shown in FIGS. 10A-C.
  • FIG. 10A when the number of Amplicons is 5, only 0.089M reads are necessary to sequence all the Amplicons with a sufficient number of continuous linked UMI families (100X) and good sequencing depth (15X) on each UMI family.
  • 559M reads are required for optimal sequencing coverage as well as enough number of UMI families, which is 6280- fold to the reads needed for 5 Amplicons.
  • the growth curve of the need of input molecules is similar to that of the sequencing reads.
  • the number of Amplicons is 5, only 0.6k input molecules are necessary' to obtain sufficient continuous linked UMI family (100X), but when the number of Amplicons increases to 25, 745k molecules are necessary for sufficient continuous linked UMI families, which is 1241-fold to the number of molecules needed for 5 Amplicons. Even when comparing the molecules needed for 25 Amplicons with that needed for 20 Amplicons, 5X more input of molecules will be required to expand from 20 Amplicons to 25 Amplicons.
  • the bioinformatics pipeline for single-end sequencing is shown in FIGS. 1 1 A-B. If the length of Sequencing-Insert of the library, i.e., the total length of library' deducted with length of Sequencing adaptor, is less than the sequencing length (typically, it ranges from 75nt to 600nt), then it is possible to collect both the information of UMI left (5' of the read) and UMI right (3' of the read) within every 7 single-end read. N reads sequenced and generated from the sequencer and put in the FASTQ file. Each read should be composed of the UMI left, the Insert Sequence, UMI right and the complete or partial sequence of the Sequencing adaptor.
  • FIGS. 12A-B Workflow for paired-end sequencing is shown in FIGS. 12A-B.
  • the length of Sequencing-Insert of the library i.e., the total length of library deducted with the length of Sequencing adaptor, is more than the sequencing length (typically, it ranges from 75nt to 600nt).
  • paired-end sequencing is necessary to sequence UMIs at both sides of the library.
  • each R1 should be composed of the UMI left and partial Insert Sequence
  • each R2 composes the UMI right and another part of Insert Sequence.
  • Each UMI set is a UM I left and a UMI right from a pair of R1 and R2.
  • each pair of reads would be clustered based on the result of alignment, that is, two pairs of reads would be considered having the same Insert Sequence if they could be aligned to the identical sub-region of the Target, then each pair of read be further collapsed based on the same UMI set.
  • molecules with different Insert Sequences are linked together based on the share of the same UMI sequence. For example, as shown in FIG. 12B, if the sequence of UMI right of a molecule with Insert Sequence 1 is the same as the sequence of UMI left of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and can be linked together.
  • PCR (37 °C for 35 minutes, 75 °C for 10 minutes, lid is set at 90 °C) by adding 1 pl T4 DNA Polymerase (Thermo Fisher), 10 pl 5X T4 DNA reaction buffer (Thermo Fisher), 1 pl l OmM dNTP (New England Biolabs) mixture and 33 pl nuclease-free water into the 5 pl annealed mixtures for a final 50 pl reaction volume.
  • 1 pl T4 DNA Polymerase Thermo Fisher
  • 10 pl 5X T4 DNA reaction buffer Thermo Fisher
  • 1 pl l OmM dNTP New England Biolabs
  • 1 pl T4 DNA Polymerase Thermo Fisher
  • 10 pl 5X T4 DNA reaction buffer Thermo Fisher
  • 1 pl lOmM dNTP New England Biolabs
  • UMI Sequence 2a and the number of UMI Set B (1st UMI Sequence lb and 2nd UMI Sequence 2b).
  • 3rd Stopper 3 pl M13mpl8 ssDNA (2,500 copies/pl) or 2 pl variant Target (3,750 copies/ pl), 1.4 pl water or 2.4 pl water with 2.6 pl 5X T4 DNA Polymerase Reaction buffer (Thermo Fisher), and perform the annealing step in Eppendorf Mastercycler.
  • the thermal cycling protocol is as follows: 1. 95 °C 2 minutes. 2. Cool from 95 °C to 20 °C at the ramp speed of 0.01 °C per 6 seconds. Lid Temperature is set at 105 °C.
  • lul T4 DNA Polymerase Thermo Fisher
  • lOul 5X 1'4 DNA reaction buffer Thermo Fisher
  • lul lOmM dNTP New England Biolabs

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided herein are compositions and methods for achieving long-range DNA sequencing through concatenating chimeric amplicon reads. Each chimeric amplicon read is formed by template switching on a. pattern of sequence complementarities. Chimeric amplicon reads can be concatenated by sharing the same 5' or 3' UMI sequence with neighboring reads and being mapped back to the original molecule with an accurate molecule count.

Description

DESCRIPTION
LONG-RANGE DNA SEQUENCING THROUGH CONCATENATING CHIMERIC
AMPLICON READS
REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims the priority benefit of United States provisional application number 63/340,250, filed May 10, 2022, the entire contents of which are incorporated herein by reference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
[0002] This invention was made with government support under Grant No. R01HG008752 awarded by the National Institutes of Health. The government has certain rights in the invention.
REFERENCE TO A SEQUENCE LISTING
[0003] This application contains a Sequence Listing XML, which has been submitted electronically and is hereby incorporated by reference in its entirety. Said Sequence Listing XML, created on April 28, 2023, is named RICEP01()2WO__ST26.xml and is 44,332 bytes in size.
BACKGROUND
1. Field
[0004] The present invention relates generally to the field of molecular biology. More particularly, it concerns compositions and methods for performing long-range, high throughput DNA sequencing.
2. Description of Related Art
[0005] The recent advances in sequencing technology, i.e., increasing throughput and decreasing cost, lead to the revolutionized evolution of research in genomics. Nowadays, the demand for massively parallel sequencing is still gradually increasing. Current sequencing technology, like Next-Generation Sequencing (NGS) or the Third Generation Sequencing (Nanopore Sequencing, PacBio Sequencing), has its limitations. [0006] The Next-Generation Sequencing is an accurate, reliable, and fast sequencing- by-synthesis method with typical read lengths of 75 nucleotides (nt) to 600nt. While the NGS platforms are barely capable of sequencing fragments of which insert size is longer than 800nt. When researching a type of long molecules (>800nt), it is necessary to randomly fragment the molecules before sequencing, then all the reads would be assembled with the attempt to map back to the original molecule (FIG. 4). The sheared fragments from the original molecules form a pool of molecules and are sequenced by the next-generation sequencer. Each sequenced fragment originates from a part of one original molecule. However, due to the lack of information that, could indicate the origin of the fragment on each resulting fragment during fragmentation, it is hard to know exactly about the origin of each sequenced fragment. Thus, it is not user-friendly to research long molecules in the biological system using NGS.
[0007] Long-read Sequencing like Nanopore Sequencing or PacBio Sequencing overcome the limitation in read length or the molecule size. However, their high error and low sequencing quality make it. challenging to map the sequenced reads back to genome or integrate molecule counting tools or methods like Unique Molecular Identifier (UMI) into these technologies.
SUMMARY
[0008] Provided herein are compositions comprising: a 1 st Stopper oligonucleotide comprising, from 5' to 3; a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, a 2nd Stopper oligonucleotide comprising , from 5! to 3', a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, and a 2nd C Sequence with a length between 6nt and 500nt, a 3rd Stopper oligonucleotide comprising, from 5' to 3', a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and a 3rd C Sequence with a length between 6nt and SOOnt, and a template-dependent polymerase enzyme, reagents, and buffers needed for polymerase function, wherein the 1st C Sequence is complementary' to a 1 st Binding Region sequence on a potential Target nucleic acid, wherein the 2nd C Sequence is complementary to a 2nd Binding Region sequence on the potential Target nucleic acid, wherein the 2nd B Sequence is complementary to a 2nd Match Region sequence to the 3’ of the 2nd Binding Region of the potential Target nucleic acid, wherein the 3rd C Sequence is complementary' to a 3rd Binding Region sequence on a potential Target nucleic acid, wherein the 3rd B Sequence is complementary' to a 3rd Match Region sequence to the 3’ of the 3rd Binding Region of the potential Target nucleic acid, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, and wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C).
[0009] In some aspects, the compositions further comprise said Target nucleic acid.
[0010] Also provided herein are methods for high throughput DNA sequencing, the methods comprising:
(1) mixing a Sample comprising a Target molecule comprising, from 5‘ to 3’, a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1st Binding Region with a template-dependent polymerase, reagents and buffers needed for enzymatic function, a 1 st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3‘, a 1 st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary to the 1st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3‘, a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3 nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, and a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, the 2nd B Sequence is reverse complementary to the 2nd Match Region, and a 2nd C Sequence with a length between 6nt and 500nt, wherein the 2nd C Sequence is the reverse complement of the 2nd Binding Region, and a 3rd Stopper oligonucleotide, w'herein the 3rd Stopper comprises, from 5’ to 3!, a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3 nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and the 3rd B Sequence is reverse complementary to the 3rd Match Region, and a 3rd C Sequence with a length between 6nt and 500nt, wherein the 3rd C Sequence is the reverse complement of the 3rd Binding Region, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and ML (mixture of A and C);
(2) incubating the mixture at a temperature conducive to polymerase activity, thereby generating Amplicons, wherein each Amplicon comprises, from 5' to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary' to the Match Region corresponding to the upstream Stopper, and an Identity-Complement Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
(3) purifying the Amplicons;
(4) appending a sequencing adaptor or sequencing index to one side or both sides of the purified Amplicons, thereby generating a library; and
(5) subjecting the library' to high-throughput sequencing.
[0011] In another embodiment, methods for high throughput DNA sequencing comprise
(1) mixing a Sample comprising a Target molecule comprising, from 5' to 3', a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1 st Binding Region with an annealing buffer, a 1st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3', a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary-' to the 1st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3', a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, and a 2nd B Sequence with a length between 3nt. and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, the 2nd B Sequence is reverse complementary-' to the 2nd Match Region, and a 2nd C Sequence with a iength between 6nt and 500nt, wherein the 2nd C Sequence is the reverse complement of the 2nd Binding Region, and a 3rd Stopper oligonucleotide, wherein the 3rd Stopper comprises, from 5! to A a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and the 3rd B Sequence is reverse complementary' to the 3rd Match Region, and a 3rd C Sequence with a length between 6nt and 500nt, wherein the 3rd C Sequence is the reverse complement of the 3rd Binding Region, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C);
(2) thermal annealing the mixture;
(3) adding a template-dependent polymerase, reagents, and buffers needed for enzy mati c functi on ;
(4) incubating the mixture at a temperature conducive to polymerase activity, thereby generating Amplicons, wherein each Amplicon comprises, from 5’ to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary to the Match Region corresponding to the upstream Stopper, and an Identity-Complement Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
(5) purifying the Amplicons;
(6) appending a sequencing adaptor or sequencing index to one side or both sides of the purified Amplicons, thereby generating a library; and
(7) subjecting the library to high-throughput sequencing.
[0012] In some aspects, the methods further comprises:
(7/8) clustering reads with the same Insert Sequence and collapsing reads with both the same 5' and 3' Identity Sequences; and
(8/9) concatenating neighboring reads through the same Identity Sequence, wherein: if the 3' Identity Sequence of read a is the same as the 5' Identity Sequence of read b, then the reads are concatenated by putting read b downstream of read a. if the 5' Identity Sequence of read a is the same as the 3! Identity Sequence of read b, then the reads are concatenated by putting read b upstream of read a.
[0013] In some aspects, the 1st Identity Sequence comprises a 1st UMI Sequence. In some aspects, the 3rd Identity Sequence comprises a 3rd UMI Sequence. In some aspects, each UMI Sequence comprises a set of designed DNA sequences. In some aspects, the UMI Sequences comprise degenerate nucleotides, selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M: (mixture of A and C).
[0014] In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5.
[0015] In some aspects, the 1st Stopper further comprises, from 5' to 3', a 1st A Sequence of between 3 nt and 50nt, a 1st Loop Sequence of between 3 nt and 70nt, and a 1st B Sequence of between 3 nt and 50nt, wherein the 1st B Sequence is the reverse complement of the 1st A Sequence. In some aspects, the potential Target or the Target comprises a 1st Match Region to the 3' of the 1st Binding Region and is complementary to the 1st B Sequence.
[0016] In some aspects, the composition or the mixture further comprises at least one Inhibitor DNA oligonucleotide having aa subsequence that is reverse complementary to a subsequence of the C Sequence of a corresponding Stopper. In some aspects, the Inhibitor DNA oligonucleotide has a subsequence at the 3‘ end at least 3nt long that is not reverse complementary' to the corresponding Stopper. In some aspects, the subsequence at the 3' end of Inhibitor DNA oligonucleotide forms at least one hairpin structure. In some aspects, the Inhibitor DNA oligonucleotide comprises non-natural nucleotides. In some aspects, the Inhibitor DNA oligonucleotide has a chemical functionalization at the 3' end that, prevents polymerase extension, including but not limited to a 3-carbon spacer, an inverted nucleotide, or a minor groove binder.
[0017] In some aspects, the annealing step comprises a thermocycling program cooling from a temperature not lower than 78 °C to a temperature not higher than 25 °C. In some aspects, the thermocycling program comprises steps that cool from 78 °C to 25 °C, wherein the solution is held at each 5°C temperature window for at least 5 minutes. In other words, for each 5 minutes of the therm ocy cling program, the program cools no faster than 5 °C per 5 minutes, i.e., spending >5 minutes between 73 °C and 78 °C, spending >5 minutes between 68 °C and 73 °C, etc.
[0018] In some aspects, the annealing step comprises subjecting the mixture to a temperature for between 10 minutes and 24 hours. In some aspects, the annealing step comprises subjecting the mixture to room temperature for between 10 minutes and 24 hours.
[0019] In some aspects, the incubation occurs at a temperature between 10 °C and 74 °C for between 1 second and 20 hours. In some aspects, the incubation comprises thermal cycling alternating between a temperature higher than 78 °C for between 1 second and 30 minutes and a temperature not higher than 75 °C for between I second and 20 hours. In some aspects, the method comprises at least 6 thermal cycles.
[0020] In some aspects, the incubation occurs at a temperature between about 10 °C and about 74 °C, between about 15 °C and about 74 °C, between about 20 °C and about 74 °C, between about 25 °C and about 74 °C, between about 30 °C and about 74 °C, between about 35 °C and about 74 °C, between about 40 °C and about 74 °C, between about 45 °C and about °C, between about 50 °C and about 74 °C, between about 55 °C and about 74 °C, between about 60 °C and about 74 °C, between about 25 °C and about 65 °C, between about 30 °C and about 65 °C, between about 35 °C and about 65 °C, or any range derivable therein. In some aspects, the incubation occurs at a temperature of about 10 °C, 15 °C, 20 °C, 25 °C, 30 °C, 35 °C, 40 °C, 45 °C, 50 °C, 55 °C, 60 °C, 65 °C, 70 °C, or 74 °C, or any value derivable therein. In some aspects, the incubation occurs for between 1 second and 20 hours, between 30 seconds and 20 hours, between 1 minute and 20 hours, between 2 minutes and 20 hours, between 5 minutes and 20 hours, between 10 minutes and 20 hours, between 30 minutes and 20 hours, between 60 minutes and 20 hours, between 2 hours and 20 hours, between 30 seconds and 2 hours, between 60 seconds and 2 hours, between 2 minutes and 2 hours, between 5 minutes and 2 hours, between 10 minutes and 2 hours, between 30 minutes and 2 hours, or any range derivable therein. In some aspects, the incubation occurs for at least 1 second, 10 seconds, 20 seconds, 30 seconds, 45 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 40 minutes, 50 minutes, or 60 minutes and at most 20 hours, 15 hours, 10 hours, 5 hours, 2 hours, 1 hour, 50 minutes, 40 minutes, 30 minutes, 20 minutes, or 10 minutes. In some aspects, the incubation occurs for 1 second, 5 seconds, 10 seconds, 20 seconds, 30 seconds, 40 seconds, 50 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 40 minutes, 50 minutes, 60 minutes, 2 hours, 3 hours, 4 hours, 5 hours, 6 hours, 7 hours, 8 hours, 9 hours, 10 hours, 12 hours, 14 hours, 16 hours, 18 hours, 20 hours, or any valuable derivable therein.
[0021] In some aspects, the incubation comprises thermal cycling alternating between a temperature higher than 78 °C (e.g., 78 °C, 79 °C, 80 °C, 81 °C, 82 °C, 83 °C, 84 °C, 85 °C, 86 °C, 87 °C, 88 °C, 89 °C, 90 °C, 91 °C, 92 °C, 93 °C, 94 °C, or 95 °C) for between 1 second and 30 minutes (e.g., 1 second, 5 seconds, 10 seconds, 20 seconds, 30 seconds, 40 seconds, 50 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, or any value derivable therein) and a temperature not higher than 75 °C (e.g., 75 °C, 74 °C, 73 °C, 72 °C, 71 °C, 70 °C, 69 °C, 68 °C, 67 °C, 66 °C, 65 °C, 64 °C, 63 °C, 62 °C, 61 °C, 60 °C, 59 °C, 58 °C, 57 °C, 56 °C, 55 °C, 54 °C, 53 °C, 52 °C, 51 °C, 50 °C, 49 °C, 48 °C, 47 °C, 46 °C, or 45 °C) for between 1 second and 20 hours (e.g., 1 second, 5 seconds, 10 seconds, 20 seconds, 30 seconds, 40 seconds, 50 seconds, 60 seconds, 2 minutes, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 40 minutes, 50 minutes, 60 minutes, 2 hours, 3 hours, 4 hours, 5 hours, 6 hours, 7 hours, 8 hours, 9 hours, 10 hours, 12 hours, 14 hours, 16 hours, 18 hours, 20 hours, or any valuable derivable therein). In some aspects, the methods further comprise at least 6, at least 7, at least 8, at least 9, at least 10, at least 1 1, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17. at least 18, at least 19, or at least 20 additional thermal cvcles.
[0022] In some aspects, the sequencing adapters and/or sequencing indexes are appended via ligation. In some aspects, the sequencing adapters and/or sequencing indexes are appended via PCR.
[0023] In some aspects, the high-throughput sequencing is performed via sequencing- by-synthesis. In some aspects, the high-throughput sequencing is performed via electrical current measurements in conjunction with a nanopore.
[0024] In some aspects, at a salinity of 0.2M sodium and a temperature of 60 °C, the C Sequence of the Stopper and the Target has a standard free energy of hybridization AG°i between -7 kcal/mol and -50 kcal/mol, the A Sequence and the B Sequence of each Stopper has a standard free energy of hybridization AG°2 between -2 kcal/mol and -50 kcal/mol, and/or the Inhibitor and the C Sequence of the Stopper has a standard free energy of hybridization AG°3 between -7 kcal/mol and -50 kcal/mol.
[0025] In some aspects, the polymerase is a thermostable polymerase. In some aspects, the polymerase is not a thermostable polymerase. In some aspects, the polymerase is selected from the group consisting of Taq DNA polymerase, Bst DNA Polymerase, or DNA Polymerase I, Hemo Klen Taq, Phusion, Q5, T7 DNA polymerase, and KAPA HiFi.
[0026] In some aspects, the Target is a biological DNA or RNA molecule. In some aspects, the Target is obtained from a sample of cells, a biofluid, or a tissue. In some aspects, the biofluid is selected from the group consisting of blood, urine, saliva, cerebrospinal fluid, interstitial fluid, and synovial fluid. In some aspects, the tissue is a biopsy tissue or a surgically resected tissue. In some aspects, the Target is a complementary DNA molecule generated through the reverse transcription of an RNA sample. In some aspects, the RN A sample is a biological RNA sample. In some aspects, the biological RNA sample is obtained from a human, animal, plant, or environmental specimen. In some aspects, the Target is an amplicon DNA molecule generated through a DNA polymerase acting on a single-stranded DNA template. In some aspects, the amplicon DNA molecule is generated through multiple displacement amplification of a single cell DNA molecule. In some aspects, the Target is a physically, chemically, or enzymatically generated product of a biological DNA molecule. In some aspects, the Target is the product of a fragmentation process. In some aspects, the fragmentation process is ultrasonication or enzymatic fragmentation. In some aspects, the Target is the product of a bisulfite conversion reaction, an APOBEC (“apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like”) reaction, a TAPS (TET-assisted pyridine borane sequencing) reaction, or other chemical or enzymatic reaction in which cytosine nucleotides are selectively converted to uracils based on methylation status.
[0027] In some aspects, the Sample comprising the Target nucleic acid is mixed with at least four Stoppers, wherein each Stopper, except the 1st Stopper, comprises, from 5' to 3; an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, a B Sequence with a length between 3 nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt. In some aspects, each Identity Sequence comprises a UMI Sequence comprising a set of designed DNA sequences, wherein the UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). In some aspects, the 1 st Stopper comprises, from 5' to 3', a 1st Identity Sequence with length between 5nt and 200nt, and a C Sequence of between 6nt and 500nt. In some aspects, the 1 st Stopper comprises, from 5' to 3‘, a 1 st Identity Sequence with a length between 5nt and 200nt, a 1st A Sequence with a length between 3 nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, a B Sequence with a length between 3nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt. In some aspects, the 1st Identity Sequence comprises a 1st LJMI Sequence comprising a set of designed DNA sequences, wherein the 1st UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. In some aspects, the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. In some aspects, the potential Target or the Target comprises a Binding Region that is reverse complementary to a corresponding C Sequence for each of the at least four Stoppers. In some aspects, the potential Target or the Target comprises a Match Region that is reverse complementary to a corresponding B sequence for each of the at least four Stoppers, wherein each Match Region is located to the 3' of the Binding Region corresponding to the same Stopper. In some aspects, the potential Target or the Target comprises a Match Region that is reverse complementary to a corresponding B Sequence.
[0028] Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0029] T he following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
[0030] FIG. 1: Reagent components. The three dotted frames denote the 1st Stopper, 2nd Stopper and the 3rd Stopper, respectively. The gray arrows on the right side of the Stoppers and the left side of the Target denote the 3' end of the oligonucleotides. The 1st Stopper has, from 5’ to 3', a 1 st Identity Sequence, and a 1st C Sequence. The 2nd Stopper has, from 5' to 3', a 2nd Identity Sequence, a 2nd A Sequence, a 2nd Loop Sequence, a 2nd B Sequence and a 2nd C Sequence, the 2nd A Sequence is reverse complementary' to the 2nd B Sequence. The 2nd Identity Sequence also has a 2nd I MI Sequence, which is composed of a set of DNA sequences. The 3rd Stopper has, from 5' to 3', a 3rd Identity Sequence, a 3rd A Sequence, a 3rd Loop Sequence, a 3rd B Sequence and a 3rd C Sequence, the 3rd A Sequence is reverse complementary to the 3rd B Sequence. From 5' to 3', the Target has a 3rd Binding Region, a 2nd Binding Region, a 2nd Match Region, and a 1st Binding Region. The 1st Binding Region, 2nd Binding Sequence and the 3rd Binding Sequence is the reverse complement of the 1st C Sequence, the 2nd C Sequence, and the 3rd C Sequence. The 2nd Match Region and 3rd Match Region is respectively reverse complementary to the 2nd B Sequence and 3rd B Sequence. The 2nd Loop Sequence and the 3rd Loop Sequence are illustrated as arcs on the right of the hairpins. The system also includes a template-dependent polymerase.
[0031] FIG. 2: Exemplary embodiment of the disclosure. Besides the 2nd Stopper, the 1 st and 3rd Identity Sequence could also have a 1st and 3rd UMI Sequence, which is composed of a set of DNA sequences.
[0032] FIG. 3: Haplotype phasing. In the typical short read sequencing method, if the haplotype phasing happens with a distance over 600nt, the short-read NGS sequencing, of which sequencing length is usually less than 600nt, is unable to estimate and identify the haplotype phasing efficiently and accurately. In this long-range DNA sequencing invention, there is 1st Stopper, 2nd Stopper and 3rd Stopper in the system to generate two types of chimeric Amplicons. The former Amplicon generated from the 1st Stopper and the 2nd Stopper would cover genetic locus 1, and the latter Amplicon from the 2nd Stopper and the 3rd Stopper carries sequence information of locus 2. The sequence of the 2nd UMI Sequence on the 2nd Stopper would exist in both former and latter Amplicons, in this way, each former Amplicon and latter Amplicon which share the same sequence of 2nd UMI Sequence, could be regarded as a pair of Amplicons that comes from the same original molecule. What is more, the number of 2nd UMI Sequence could quantitate the number of original molecules. Thus, the haplotype phasing can be more efficiently and accurately identified.
[0033] FIG. 4: Current limitation in mapping full-length sequence using Next- Generation Sequencing. When there is one Target 1 with Region 1, 3, 4, 6, the other Target 2 with all Region 1 to Region 6. In the standard and typical short-read Next Generation Sequencing, the Targets will be randomly fragmented and sequenced, then all the reads would be assembled with the attempt to map back to the original molecule. Due to the lack of linkage between each fragment and read, it is tough to identify the reads from the original wildtype Target and those from the original variant Target, thus it is challenging to reconstruct the original molecules.
[0034] FI€». 5: Necessity of Hong-range chimeric amplicon sequencing. When there is one Target 1 with Region 1, 3, 4, 6, the other Target 2 with all Region 1 to Region 6. In the long-range chimeric amplicon sequencing, the amplicons generated would contain two UMIs, the former UMI of the amplicon is matched with the latter UMI of the last amplicon, and the latter UMI of the amplicon is matched with the former UMI of the next amplicon. In this way, the reads covering amplicons from different regions could be concatenated via the UMI assembly. Then each molecule’s full sequence could be easily mapped back to the sequence of the original wildtype or variant Target.
[0035] FIG. 6: Mechanism of the long-range chimeric amplicon concatenation. .As shown, the chimeric lst-2nd Amplicon generated between 1st Stopper and 2nd Stopper, from 5‘ to 3', has the sequence of the 1st Stopper, the lst-2nd Insert Sequence, the 2nd Match- Complement Sequence, and the 2nd Identity-Complement Sequence. The lst-2nd Amplicon comprises both the 1st UMI Sequence and the 2nd UMI Sequence, called one unique molecular pair on lst-2nd Amplicon. The 2nd-3rd Amplicon from the 2nd Stopper and 3rd Stopper has the similar structure: from the 5' to 3', it has the sequence of the 2nd Stopper, the 2nd-3rd Insert Sequence, the 3rd Match-Complement Sequence, and the 3rd Identity- Complement Sequence. The 2nd-3rd Amplicon has the unique molecular pair composed of the 2nd UMI Sequence near its 5' and the 3rd UMI Sequence close to its 3’ end. When assembling the unique molecular pair, if the unique molecular pair of the lst-2nd Amplicon shares the same 2nd UMI Sequence with the 2nd-3rd Amplicon, then these 2 Amplicons could be identified as the amplicons from the same original Target molecule.
[0036] FIGS. 7A-C: Experimental validation of the pairing efficiency. As shown in FIG. 7A, mixture 1 and mixture 2 were pre-annealed. The mixture 1 was composed of 1st Stopper la, 2nd Stopper 2a and the Target 1 nucleic acid. The mixture 2 was composed of 1st Stopper lb and 2nd Stopper 2b and the Target 1 nucleic acid. Then the mixture 1 and mixture 2 were mixed together and proceed to next steps. The detailed sequence is shown in the FIG. 7B, the 1st Stopper la and the 1st Stopper lb have the same structure and sequence except for the sequence of the 1st UMI Sequence: the 1st Stopper la has the 1st UMI Sequence la, and the 1st Stopper lb has the 1st UMI Sequence lb. Similarly, the 2nd Stopper 2a and the 2nd Stopper 2b have the same structure and sequence except that the 2nd Stopper 2a has the 2nd UMI Sequence 2a, and the 2nd Stopper 2b has the 2nd UMI Sequence 2b. The results were summarized in FIG. 7C, when mixing the pre-annealed mixture 1 and mixture 2 together, the reads from UMI Sequence la&2a (UMI Set A) is similar to the reads from Identity Sequence lb&2b (UMI Set B), there is no read showing the mispairing between UMI Set A and UMI Set B, demonstrating that the pre-annealing product will not dissociate and re-annealing in the downstream process. [0037] FIGS. 8A-C: Experimental validation of long-range chimeric amplicon concatenation. As shown in FIG. 8 A, the reaction system is composed of 1st Stopper, 2nd Stopper, 3rd Stopper which ah have binding regions on the Target nucleic acid. The ideal and preferred products are both the lst-2nd Amplicon and 2nd-3rd Amplicon which share the same 2nd UMI Sequence from one Target. There would be some unwanted or side products. As shown in FIG. 8B, the missing attachment of Stopper 2 would lead to the generation of 1 st-3rd WT Amplicon, also the missing or Stopper I or Stopper 3 would result in the lonely lst-2nd Amplicon or the 2nd-3rd Amplicon. No reads indicating the existence of lst-3rd WT Amplicon were found. What is more, the UMI family size of I st~2nd Amplicons and the UMI family size of 2nd-3rd Amplicons are close (FIG. 8A). Besides, the percentage of Amplicon pairs in which the lst-2nd Amplicon shares the same 2nd Identity Sequence with the 2nd-3rd Amplicon was calculated, and there are about 59.77% Amplicons that could be linked together. The UMI family size distribution plots of lst-2nd Amplicon and the 2nd-3rd Amplicon are shown in FIG. 8C.
[0038] FIGS. 9A-C: Experimental validation of variant skipping detection. As shown in FIG. 9A, the reaction system comprises the 1st Stopper, 2nd Stopper, 3rd Stopper and the wildtype Target, thus the l st-2nd Amplicon and the 2nd-3rd Amplicon would be generated. In FIG. 9B, there are 1st Stopper, 2nd Stopper, 3rd Stopper mixed with the variant Target, only the 1 st-3rd var Amplicon would be generated. The summary of result is shown in FIG. 9C: when mixing the wildtype and the variant Target together, the reads and the UMI family sizes of the lst-2nd Amplicon and the 2nd-3rd Amplicon were close to that in wildtype only group. Also, the reads of the lst-3rd Var Amplicon were close to that in the variant only group as well.
[0039] FIGS. 10A-C: MATLAB simulation of the number of reads and molecules needed for N amplicons. As shown in FIG. 10A, the results of MATLAB simulation show that, the need for reads is non-linear but more like an exponential growth as the number of Amplicons goes up. When the number of Amplicons is 5, only 0.089M reads are necessary to sequence all the Amplicons with enough number of continuous linked UMI families (100X) and good sequencing depth (15X) on each UMI family. However, when the number of Amplicons becomes 25, 559M reads are necessary for optimal sequencing coverage, which is 6280-fold to the reads needed for 5 Amplicons. Even when comparing the reads needed for 25 Amplicons with that for 20 Amplicons, 6.5X more reads will be required to expand from 20 Amplicons to 25 Amplicons. Details of the simulation are shown in FIG. 10B. The need of input molecules is shown in the right panel in FIG. 10A with similar growth curve as that in the left panel. When the number of Amplicons is 5, only 0.6k input molecules are necessary to obtain enough continuous linked UMI family (100X), but when the number of Amplicons becomes 25, 745k input molecules are necessary for enough continuous linked UMI families, which is 1241-fokl to the molecules needed for 5 Amplicons. Even when comparing the molecules needed for 25 Amplicons with that needed for 20 Amplicons, 5X more input of molecules will be required to expand from 20 Amplicons to 25 Amplicons. Details of the simulation are shown in FIG. IOC.
[0040] FIGS. 11A-B: Bioinformatics pipeline of single-read sequencing. If the length of Sequencing-Insert of the library', i.e., the total length of library deducted with length of Sequencing adaptor, is less than the sequencing length (typically, it ranges from 75nt to 600nt), then single-read sequencing is enough. In FIG. 11 A, a cartoon illustrates that short Amplicons are able to obtain full-length coverage of sequence information in the Target region. As shown in FIG. 1 IB, N reads sequenced and generated from the sequencer, each read should compose of the UMI left, the Insert Sequence, UMI right and the Sequencing adaptor sequence when the length of Sequencing-Insert of the library' is less than the sequencing length. Then, the adaptor sequence of each read would be trimmed. The trimmed reads would then be grouped by the same Insert Sequence first, then be clustered to different molecules based on the same UMI set (UMI left + UMI right). Finally, molecules with different Insert Sequences were linked together based on the same UMI sequence. For example, as shown in the bottom panel, if the UAH right sequence of a molecule with Insert Sequence 1 is the same as the UMI left sequence of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and could be linked together.
[0041] FIGS. 12A-B: Bioinformatics pipeline of paired-end sequencing. Suppose the length of Sequencing-Insert of the library, i.e., the total length of library' deducted with length of Sequencing adaptor, is more than the sequencing length (typically, it ranges from 75nt to 600nt). In that case, paired-end sequencing is necessary' to sequence UMIs at both sides of the library. A cartoon shown in FIG. 12A illustrates that sequencing of long Amplicons would only achieve partial coverage of sequence information in the Target region. As shown in FIG. 12B, N pairs of reads sequenced and generated from the sequencer, each R1 should compose of the UMI left and partial sequence of the Insert Sequence, each R2 is composed of the UMI right and partial sequence of the Insert Sequence. Each UMI set is a UN1I left and a UMI right from a pair of R1 and R2. Then each pair of reads would be aligned back to the Target sequence, then be grouped by the same alignment result. Next, each group with the same alignment result would be clustered to different molecules based on the same sequence of UMI set. Finally, molecules with different alignments were linked together based on the same UMI sequence. For example, as shown in the bottom panel, if the UMI right sequence of a molecule with Insert Sequence 1 is the same as the UMI left sequence of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and could be linked together.
[0042] FIG. 13: Two embodiments of the invention. The system may further include at least one Inhibitor oligonucleotide. The vertical line on the left side of each Inhibitor denotes the 3' of Inhibitor is non-extensible because of either the rationally designed sequence or chemical modification. On the left embodiment, the Inhibitor is fully reverse complementary to the corresponding C Sequence of the Stopper. On the right embodiment, the Inhibitor is the reverse complement of partial corresponding C Sequence of the Stopper.
[0043] FIGS. 14A-B: Two embodiments of the 1st Stopper and the Target For the top embodiment (A), the 1st Stopper could further comprise a 1st A Sequence, a 1st Loop Sequence, and a 1st B Sequence. The 1 st A Sequence is reverse complementary' to the 1st B Sequence. On the bottom embodiment (B), the Target further comprises a 1st Match Region that, is reverse complementary to the 1st B Sequence.
[0044] FIGS. 15A-B: Embodiment of (N-l) concatenated chimeric Amplicons. When the system has N Stoppers, there would be N-l Amplicons generated: lst-2nd Amplicon, 2nd-3rd Amplicon, ... (N-l)th-Nth Amplicon. In FIG. 15B, the lst-2nd Amplicon has a hairpin structure and sequence.
DETAILED DESCRIPTION
[0045] Provided herein are reagents and methods for achieving long-range DNA Sequencing through concatenating chimeric amplicon reads. Each chimeric amplicon read is formed by template switching on a pattern of sequence complementarities. Chimeric amplicon reads can be concatenated by sharing the same 5' or 3' UMI sequence with neighboring reads and being mapped back to the original molecule with an accurate molecule count. These methods address the limitation of sequencing length in Next-Generation Sequencing (NGS) and of sequencing quality in Third-Generation Sequencing, This technology solves the short-read sequencing problem of NGS by extending the total length of the sequence-able molecule. This technology also takes advantage of the high accuracy in NGS to compensate for the low sequencing quality of long-read sequencing, like Third- Generation Sequencing.
I. Concatenating Chimeric Amplicon Reads for Long-range DNA Sequencing
[0046] Provided herein are compositions and methods for achieving long-range DNA Sequencing using short-read NGS (Illumina) sequencing. These methods allow for sequencing molecules with lengths up to SOOOnt. using typical short-read (75nt to 600nt.) sequencing. Unique Molecular Identifiers (UMI) can tag and quantify each original molecule. The comprehensive construction of the whole length of the original molecule is achieved by concatenating the UMI barcoded short reads. This provides high-quality sequence information for the full length of long molecules as well as quantifying the initial molecule number in the NGS short-read platform with high accuracy.
[0047] In the standard and typical short-read Next Generation Sequencing, the Targets are randomly fragmented and sequenced. Then, all the reads are assembled with an attempt to map back to the original molecule. Due to the lack of linkage between each fragment and read, it can be difficult to identify whether the reads were derived from the original wildtype Target or from a variant Target. For example, as illustrated in FIG. 4, if Target 1 has Regions 1, 3, 4, and 6, and Target 2 has Regions 1, 2, 3, 4, 5, and 6, it is challenging to reconstruct the original wildtype and variant Target molecules. As such, the lack of information denoting the origin of each sheared fragments leads to complex alignment of sequenced reads and uncertainty of the mapped results.
[0048] However, in long-range DNA sequencing through concatenating chimeric amplicon reads (FIG. 5), every amplicon generated contains two UMIs, where the upstream UMI of the amplicon is matched with the downstream UMI of the next upstream amplicon, and the downstream UMI of the amplicon is matched with the upstream UMI of the next downstream amplicon (FIG. 5). In this way, the reads covering amplicons from different regions can be concatenated via UMI assembly. Then each molecule’s full sequence can be easily mapped back to the sequence of the original wildtype Target or variant Target. Exemplary detailed mechanisms of concatenating chimeric amplicon reads are shown in FIG. 6.
[0049] The reagent systems provided herein comprise a 1st Stopper, 2nd Stopper, and 3rd Stopper, which all have binding regions on the Target nucleic acid. The Target is defined as the nucleic acid molecule to which the Stopper bind, and which serves as the initial template for the polymerase extension. The 1st Stopper, 2nd Stopper and the 3rd Stopper are rationally designed to bind to the Target sequence and have a pattern of sequence complementarities so that the polymerase can achieve template switching during the extension. Any additional 5' sequences on the Stopper are incorporated into the Amplicon, in addition to the sequences on the Target between the former Stopper and the binding region of Target and the latter Stopper.
[0050] The 1 st Stopper is a nucleic acid species that comprises, from 5' to 3', a 1 st Identity Sequence and a 1st C Sequence (FIG. 1). The 1st Identity Sequence may comprise a 1 st UMI Sequence, which comprises a set of designed DNA sequences having at least 100 members (FIG. 2). The 1 st Stopper may further comprise a 1st A Sequence, a 1st Loop Sequence, and a 1st B Sequence, where the 1st A Sequence is reverse complementary to the 1st B Sequence (FIG. 14).
[0051] The 2nd Stopper is a nucleic acid species that comprises, from 5' to 3', a 2nd Identity Sequence, a 2nd A Sequence, a 2nd Loop Sequence, a 2nd B Sequence, and a 2nd C Sequence (FIG. 1). The 2nd A Sequence is reverse complementary to the 2nd B Sequence, such that the 2nd A Sequence and the 2nd B Sequence form a hairpin stem, with the 2nd Loop Sequence between the 2nd A Sequence and the 2nd B Sequence. The 2nd Identity Sequence comprises a 2nd UMI Sequence, which is composed of a set of DNA sequences having at least 100 members.
[0052] The 3rd Stopper is a nucleic acid species that, comprises, from 5' to 3’, a 3rd Identity Sequence, a 3rd A Sequence, a 3rd Loop Sequence, a 3rd B Sequence, and a 3rd C Sequence. The 3rd A Sequence is reverse complementary to the 3rd B Sequence, with the 3rd Loop Sequence between the 3rd A Sequence and the 3rd B Sequence. The 3rd Identity Sequence may comprise a 3rd UMI Sequence, which comprises a set of designed DNA sequences having at least 100 members. [0053] In some embodiments, the length of a C sequence is between 6nt and 500nt, between 6nt and 400nt, between 6nt and 300nt, between 6nt and 200nt, between 6nt and lOOnt, between 6nt and 75nt, between 6nt and 50nt, between 6nt and 25nt, between 6nt and 15nt, between 15nt and 500nt, between 15nt and 400nt, between 15nt and 300nt, between 15nt and 200nt, between 15nt and lOOnt, between 15nt and 75nt, between 15nt and 50nt, between 15nt and 25nt, between 30nt and 500nt, between 30nt and 400nt, between 30nt and 300nt, between 30nt and 200nt, between 30nt and lOOnt between 30nt and 75nt, between 30nt and 50nt, or any range derivable therein. In some embodiments, the length of a C sequence is at least 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 2.5m, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, 200nt, 250nt, 300nt, 350nt, 400nt, or 450nt and at most 500nt, 450nt, 400nt, 350nt, 300nt, 250nt, 200nt, 150nt, lOOnt, 90nt, 80nt, 70nt, 60nt, 50nt, 40nt, 30nt, 25nt, 20nt, or 15nt. In some embodiments, the length of the Binding Region is 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, 200nt, 250nt, 300nt, 350nt, 400nt, 450nt, or 500nt, or any value derivable therein.
[0054] In some embodiments, the length of an A sequence or a B sequence is between 3 nt and 5 Ont, between 3 nt and 40nt, between 3nt and 3 Ont, between 3 nt and 25nt, between 3 nt and 20nt, between 3 nt and 15nt, between 3 nt and lOnt, between 3 nt and 5nt, between 5nt and 50nt, between 5nt and 40nt, between 5nt and 30nt, between 5nt and 25nt, between 5nt and 20nt, between 5nt and I 5nt, between 5nt and lOnt, between lOnt and 50nt, between lOnt and 40nt, between lOnt and 30nt, between lOnt and 25nt, between lOnt and 20nt, between lOnt and 15nt, between 15nt and 50nt, between 15nt and 40nt, between 15nt and 30nt, between 15nt and 25nt, between 15nt and 20nt, between 20nt and 50nt, between 20nt and 40nt, between 20nt and 30nt, between 20nt and 25nt, or any range derivable therein. In some embodiments, the length of an A sequence or B sequence is 3 nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int., 12nt, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 2.0m, 2Int, 22nt, 23nt, 24nt, 25nt, 26nt, 27nt, 28nt, 29nt, 30nt, 31nt, 32nt, 33nt, 34nt, 35nt, 36nt, 37nt, 38nt, 39nt, 40nt, 41nt, 42nt, 43nt, 44nt, 45nt, 46nt, 47nt, 48nt, 49nt, or 50nt.
[0055] In some embodiments, the length of an Identity Sequence is between 5nt and 200nt, between 5nt and I50nt, between 5nt and lOOnt, between 5nt and 75nt, between 5nt and 50nt, between 5nt and 25nt, between 5nt and 15nt, between 15nt and 200nt, between 15nt and 150nt, between 15nt and lOOnt, between 15nt and 75nt, between 15nt and 50nt, between 15nt and 25nt, between 30nt and 200nt, between 30nt and 150nt, between 30nt and lOOnt, between 30nt and 75nt, between 30nt and 50nt, or any range derivable therein. In some embodiments, the length of an Identity Sequence is at least 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 150nt, or 175nt and at most 200nt, 150nt, lOOnt, 90nt, 80nt, 70nt, 60nt, 50nt, 40nt, 30nt, 25nt, 20nt, or 15nt. In some embodiments, the length of the First Sequence is 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, I Int, 12nt, 13nt, 14nt, 15nt, 20nt, 25nt, 30nt, 40nt, 50nt, 60nt, 70nt, 80nt, 90nt, lOOnt, 11 Ont, 120nt, 13Ont, I40nt, 150nt, 160nt, I70nt, 180nt, 190nt, 200nt, or any value derivable therein.
[0056] In some embodiments, the length of a Loop Sequence is between 3nt and 70nt, between 3nt and 60nt, between 3nt and 50nt, between 3nt and 40nt, between 3nt and 30nt, between 3nt and 25nt, between 3 nt and 20nt, between 3 nt and 15nt, between 3nt and lOnt, between 3nt and 5nt, between 5nt and 70nt, between 5nt and 60nt, between 5nt and 50nt, between 5 nt and 40nt, between 5nt and 3 Ont, between 5nt and 25nt, between 5nt and 20nt, between 5 nt and 15nt, between 5nt and lOnt, between lOnt and 70nt, between lOnt and 60nt, between lOnt and 50nt, between lOnt and 40nt, between lOnt and 30nt, between lOnt and 25nt, between I Ont and 20nt, between lOnt and 15nt, between 15nt and 70nt, between 15nt and 60nt, between 15nt and 5 Ont, between 15nt and 40nt, between 15nt and 3 Ont, between 15nt and 25nt, between 15nt and 20nt, between 20nt and 70nt, between 20nt and 60nt, between 20nt and 50nt, between 20nt and 40nt, between 20nt and 30nt, between 20nt and 25nt, or any range derivable therein. In some embodiments, the length of a Loop Sequence is 3nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, lOnt, l int, 12nt, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt,
21 nt, 22nt, 23nt, 24nt, 25nt, 26nt, 27nt, 28nt, 29nt, 30nt, 3 Int, 32nt, 33nt, 34nt, 35nt, 36nt,
37nt, 38nt, 39nt, 40nt, 4 Int, 42nt, 43nt, 44nt, 45nt, 46nt, 47nt, 48nt, 49nt, 50nt, 5 Int, 52nt,
53nt, 54nt, 55nt, 56nt, 57nt, 58nt, 59nt, 60nt, 61 nt, 62nt, 63nt, 64nt, 65nt, 66nt, 67nt, 68nt,
69nt, or 7 Ont.
[0057] In some aspects, a portion of the Loop Sequence is complementary to a region of the Target nuclei c acid positioned immedi ately 3' of the corresponding Match Region. This portion of the Loop Sequence may have a length of Int, 2nt, 3nt, 4nt, 5nt, 6nt, 7nt, 8nt, 9nt, I Ont., l int, 12m, 13nt, 14nt, 15nt, 16nt, 17nt, 18nt, 19nt, 20nt, or more.
[0058] The Target is a nucleic acid species that comprises, from 5' to 3’, a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region, and a 1 st Binding Region. The 1st Binding Region, 2nd Binding Region, and the 3rd Binding Region are the reverse complements of the 1 st C Sequence, the 2nd C Sequence, and the 3rd C Sequence, respectively. The 2nd Match Region and 3rd Match Region are the reverse complements of the 2nd B Sequence and 3rd B Sequence, respectively. The 3rd A Sequence is rationally designed to have the same sequence as the 3rd Match Region, and the 3rd Match Sequence is reverse complementary to the 3rd B Sequence of the 3rd Stopper. The 2nd A Sequence is rationally designed to have the same sequence as the 2nd Match Region, and the 2nd Match Sequence is reverse complementary to the 2nd B Sequence of the 2nd Stopper. The 2nd Loop Sequence and the 3rd Loop Sequence are illustrated as arcs on the right of the hairpins. If the 1st Stopper has a 1 st B sequence, then the Target may further comprise a 1st Match Region that is reverse complementary to the 1st B Sequence. The system also includes a templatedependent polymerase.
[0059] The Target, which serves as the initial template for the polymerase extension, is the nucleic acid molecule with which the Stopper hybridizes. The downstream Stopper oligonucleotides are rationally designed to hybridize to the Target sequence and have a pattern of sequence complementarities such that the extending polymerase switches to recognizing the Stopper as the template at the loci where the Target and the Stopper are bound. Any additional 5' sequences on the Stopper are incorporated into the Amplicon, in addition to the sequences on the Target between, and including, the upstream Binding Region and the Match Region. The polymerase extends the 1st Stopper along the Target and then switches templates with the help of the 2nd Stopper, generating a lst-2nd Amplicon. Similarly, the polymerase generates the 2nd-3rd Amplicon based on the 2nd and the 3rd Stopper. The downstream UMI of the lst-2nd Amplicon will be the reverse complement of the 2nd UMI Sequence of the 2nd Stopper, which can be used to define the lst-2nd-3rd UMI linked molecules (FIG. 6).
[0060] The mechanism of induced template switching is as follows. First, the upstream Stopper is bound to the Target. The base pairs formed between the Target and the downstream Stopper at the Match Region will form base stacks with the base pairs formed by the Stopper’s hairpin comprising the A Sequence and the B Sequence. When the polymerase recognizes the Target as the template and extends the upstream Stopper to the Targetdownstream Stopper binding junction, the polymerase finishes extending on the Match Region but is unable to further extend, due to the crossover geometry' present. Because of the complementarity relationship between the Match Region and the B Sequence, the multi- stranded molecule can spontaneously rearrange via branch migration such that the 3’ end of the polymerase extension product bridges over the crossover junction and binds to the A Sequence of the downstream Stopper. The polymerase is then able to continue extending, now recognizing the downstream Stopper as the template. The chimeric Amplicon finishes extension, and has its 5' sequence dependent on the upstream Stopper and the Target and its 3‘ sequence dependent on the downstream Stopper. If the B Sequence of the Stopper is not complementary to the Match Region, then the spontaneous rearrangement is not possible, and polymerase extension stalls at the locus where the downstream Stopper binds the Target. This is described in further detail in U.S. Provisional Application No, 63/182,154, filed April 30, 2021, which is incorporated by reference herein in its entirety'.
[0061] In one embodiment, the chimeric l st-2nd Amplicon generated between 1 st Stopper and 2nd Stopper, from 5' to 3', has the sequence of the 1st Stopper, the lst-2nd Insert Sequence, the 2nd Match-Complement Sequence, and the 2nd Identity-Complement Sequence (FIG. 6). The lst-2nd Amplicon comprises both the 1st I. X II Sequence and the reverse complement of the 2nd UMI Sequence, called one unique molecular pair on lst-2nd Amplicon. The 2nd-3rd Amplicon from the 2nd Stopper and 3rd Stopper has a similar structure: from 5' to 3’, it has the sequence of the 2nd Stopper, the 2nd-3rd Insert Sequence, the 3rd Match-Complement Sequence, and the 3rd Identity-Complement Sequence. The 2nd- 3rd Amplicon has the unique molecular pair composed of the 2nd UMI Sequence near its 5‘ and the reverse complement of the 3rd UMI Sequence close to its 3' end. In this way, the 1st- 2nd Amplicon comprises the 1st UMI Sequence at its 5' end, and the reverse complement of 2nd UMI Sequence at its 3’ end, which are defined as a unique molecular pair. The 2nd-3rd Amplicon also comprises the unique molecular pair: the 2nd UMI Sequence at its 5' end, the reverse complement of the 3rd UMI Sequence at its 3' end. When assembling the unique molecular pair, if the unique molecular pair of the lst-2nd Amplicon has the reverse complement of the 2nd UMI Sequence, which is present on the 2nd~3rd Amplicon, then these 2 Amplicons would be identified as neighboring amplicons from the same original Target molecule (FIG. 6).
[0062] In some embodiments, the system further comprises at least one Inhibitor oligonucleotide (FIG. 13). The vertical line on the left side of each Inhibitor indicates that the 3' of the Inhibitor is non-extensible because of either a rationally designed sequence or chemical modification. The Inhibitor may be fully reverse complementary to the corresponding C Sequence of the Stopper. The Inhibitor may be the reverse complement of part of the corresponding C Sequence of the Stopper.
[0063] In some embodiments, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least 12, at least 14, at least 16, at least 18, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, or at least 50 different Stoppers can be rationally designed along a Target. When the system has N Stoppers, there would be N-l Amplicons generated: lst-2nd Amplicon, 2nd-3rd Amplicon, 3rd-4th Amplicon, ... (N-l)th- Nth Amplicon (FIG. 15).
II, Alternative RNA Splicing
[0064] In some embodiments, this long-range DNA sequencing method could be used to profile RNA isoforms from alternative splicing. Each Stopper would be rationally designed to target specific exons. If an exon of a transcript is skipped or a subregion of a transcript is alternatively spliced, there would be one or more Stoppers failing to bind to the Target. For example, MET exon 14 skipping is a known marker of NSCLC (non-small cell lung cancer). The technology provided herein could be used to detect the MET exon 14 skipping by rationally designing Stoppers to target exon 13, 14, and 15 of MET. If exon 14 of MET is skipped, then the middle Stopper would fail to bind to the Target, and variant Amplicon generated from the first and last Stoppers would be detected and sequenced.
III. Haplotype Phasing
[0065] In some embodiments, the long-range DNA sequencing methods provided herein can be used to determine the haplotype phase in which a combination of alleles or a set of single nucleotide polymorphisms (SNPs) is found on the same chromosome. FIG. 3 show's an application of the methods disclosed herein for haplotype phasing. When the distance of two loci is over 600nt, traditional short-read NGS sequencing method is unable to estimate the haplotype phasing. 3 Stoppers (1st Stopper, 2nd Stopper with 2nd I M l Sequence, and 3rd Stopper) are rationally designed to generate two chimeric Amplicons, where the former Amplicon generated from the 1st Stopper and the 2nd Stopper contains information of genetic locus 1 , and the latter Amplicon generated from the 2nd Stopper and the 3rd Stopper covers genetic locus 2. Compared with standard NGS method, these methods are able to confidently pair the former and latter Amplicons by the complementary 2nd UMI Sequences and to accurately quantitate the number of molecules by the number of detected 2nd UMI Sequence.
IV. Definitions
[0066] As used herein, ‘"essentially free,” in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts. The total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below7 0.01%. Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.
[0067] As used herein the specification, “a” or “an” may mean one or more. As used herein in the claim(s), when used in conjunction with the word “comprising,” the words “a” or “an” may mean one or more than one.
[0068] The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.
[0069] Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the device, the inherent variation in the method being employed to determine the value, the variation that exists among the study subjects, or a value that is within 10% of a stated value.
[0070] The complementarity relationships described herein are not necessarily 100% complementarity. In some embodiments, the complementarity relationships described herein are >95%, >90%, >85%, or >80% complementarity. In other words, if a first sequence is defined as being complementary to a second sequence, then the reverse complement of the first sequence may be 100%, >95%, >90%, >85%, or >80% identical to the second sequence. By way of example, if one sequence is defined as being at least 80% complementary' to another sequence, then that sequence is at least 80% identical to the reverse complement of the other sequence. [0071] “Identity” or “homology” refers to sequence similarity between two nucleic- acid molecules. Identity can be determined by comparing a corresponding position in each sequence or by comparing an alignment of the sequences being compared. When a position in the compared sequences is occupied by the same base, then the molecules are identical at that position. A degree of identity between sequences can be a function of the number of matching or homologous positions shared by the sequences. “Unrelated” or “non- complementary” sequences share less than 40% identity, or alternatively less than 25% identity. Sequence identity can refer to a % identity of one sequence to another sequence. As a practical matter, when sequences are defined as being complementary, then the reverse complement of one of the sequences will be at least 50%, at least 60%, at least 70%, at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100% identical to the other sequence. One particular example of algorithms that are suitable for determining percent sequence identity is the BLAST and BLAST 2.0 algorithms, which are described in Altschul et al. (1977) Nucl. Acids Res. 25:3389-3402 and Altschul et al. (1990) J. Mol. Biol. 215:403-410, respectively. BLAST and BLAST 2.0 can be used, for example, to determine percent sequence identity for two or more polynucleotide sequences. Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information.
[0072] In some embodiments, two nucleic acid molecules are complementary if they can hybridize with each other under stringent conditions. As used herein “stringent conditions” are those conditions that allow hybridization between or within one or more nucleic acid strand(s) containing complementary sequence(s), but precludes hybridization of random sequences. Stringent conditions tolerate little, if any, mismatch between a nucleic acid and a target strand. Such conditions are well known to those of ordinary skill in the art, and are preferred for applications requiring high selectivity. By way of example, stringent conditions may comprise low salt and/or high temperature conditions, such as provided by about 0.02 M to about 0.15 M NaCl at temperatures of about 50°C to about 70°C. It is understood that the temperature and ionic strength of a desired stringency are determined in part by the length of the particular nucleic acid(s), the length and nucleobase content of the sequence(s), the charge composition of the nucleic acid(s), and to the presence or concentration of solvent(s) in a hybridization mixture. It is also understood that these ranges, compositions and conditions for hybridization are mentioned by way of non-limiting examples only, and that the desired stringency for a particular hybridization reaction is often determined empirically by comparison to one or more positive or negative controls. In some embodiments, two nucleic acid molecules are complementary if they can hybridize with each other under low stringency conditions. Non-limiting examples of low' stringency include hybridization performed at about 0.15 M to about 0.9 M NaCl at a temperature range of about 20°C to about 50°C. Of course, it is within the skill of one in the art to further modify the low or high stringency conditions to suit a particular application. In some embodiments, two nucleic acid molecules are non-complementary if they are unable to hybridi ze with each other under low stringency conditions.
[0073] Several researchers have used sets that satisfy the conditions imposed by a Hamming Code [Hamming, R. W., Bell System Technical Journal v. 29, no. 2, pp. 147-160, April 1950, Hamady et. al. (2008), Nature Methods v. 5 no. 3, pp 235-237, Lefrancois et. al. (2009), BMC Genomics v. 10 no. 37 pp 1-18], Others have used sets that satisfy more complex conditions than a Hamming Code but share the similar guarantee of a certain minimal pairwise Hamming distance [Fierer et. al. (2008), PNAS v. 105 no. 46 pp 17994- 17999, Krishnan et. al. (2011), Electronics Letters v. 47 no. 4 pp, 236-237], one previously described approach for barcode design relies on generating oligonucleotides that are a certain Hamming distance (number of mismatches) apart, that follow an “error correcting” scheme such as a Hamming Code (see, e.g., Hamady et al. Nat. Method 5: 235-237, and en.wikipedia.org/wikiZHamming code). As an alternative to Hamming-distance based barcodes, others have selected sets of barcodes which satisfy a minimum pairwise edit distance. Sets of such barcodes can work with insertion, deletion or substitution errors in the read of a barcode sequence.
[0074] Methods for the calculation of standard free energy of hybridization ( 'AG °) values from sequences are known in the art. There exist different conventions for calculating the AG° of different region interactions. WO2015/179339, which is incorporated herein by reference in its entirety, provides exemplary energy calculations based on the nearest neighbor model. The calculation of AG°i, AG0?, AG0?, and AG°4 from the Stopper sequence, Target oligonucleotide sequence, and/or Inhibitor oligonucleotide sequence, at operational temperature and operational buffer conditions, are known to those skilled in the art. The operational temperature may be about 20°C, about 25°C, about 30°C, about 35°C, about 40°C, about 45°C, about 50°C, about 55°C, about 60°C, about 65°C, or about 70°C. The operational buffer conditions may be buffer conditions suitable for PCR, such as, for example, a salinity of 0.2 M sodium.
[0075] “Amplification,” as used herein, refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 2-100 “cycles” of denaturation and replication.
[0076] “Polymerase chain reaction,” or “PCR,” means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA. In other words, PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates. Usually, the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument. In some cases, the annealing and extension steps may be combined into a single step. Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g., exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).
[0077] “Primer” means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase. Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 6 to 100 nucleotides in length, such as 6 to 70, 10 to 50, 10 to 75, 15 to 60, 15 to 40, 15 to 45, 18 to 30, 18 to 40, 20 to 30, 20 to 40, 21 to 25, 21 to 50, 22 to 45, 25 to 40, and any length between the stated ranges. In some embodiments, the primers are usually not more than about 6, 7, 8, 9, 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length. [0078] “Incorporating,” as used herein, means becoming part of a nucleic acid polymer.
[0079] The term “in the absence of exogenous manipulation” as used herein refers to there being modification of a nucleic acid molecule without changing the solution in which the nucleic acid molecule is being modified. In specific embodiments, it occurs in the absence of the hand of man or in the absence of a machine that changes solution conditions, which may also be referred to as buffer conditions. However, changes in temperature may occur during the modification.
[0080] A “nucleoside” is a base-sugar combination, i.e., a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide. For example, the nucleotide deoxyuridine triphosphate, dUTP, is a deoxyribonucleoside triphosphate. After incorporation into DNA, it selves as a DNA monomer, formally being deoxyuridylate, i.e., dUMP or deoxyuridine monophosphate. One may say that one incorporates dUTP into DNA even though there is no dUTP moiety in the resultant DNA. Similarly, one may say that one incorporates deoxyuridine into DNA even though that is only a part of the substrate molecule.
[0081] “Nucleotide,” as used herein, is a term of art that refers to a base-sugar- phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e., of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
[0082] The term “nucleic acid” or “polynucleotide” will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g., adenine “A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil “U” and C). The term “nucleic acid” encompasses the terms “oligonucleotide” and “polynucleotide.” “Oligonucleotide,” as used herein, refers collectively and interchangeably to two terms of art, “oligonucleotide” and “polynucleotide.” Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein. The term “adapter” may also be used interchangeably with the terms “oligonucleotide” and “polynucleotide.” In addition, the term “adapter” can indicate a iinear adapter (either single stranded or double stranded) or a stem-loop adapter. These definitions generally refer to at least one singlestranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary/ to at least one single-stranded molecule. Thus, a nucleic acid may encompass at least one double-stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or “complements)” of a particular sequence comprising a strand of the molecule. As used herein, a single stranded nucleic acid may be denoted by the prefix “ss,” a double-stranded nucleic acid by the prefix “ds,” and a triple stranded nucleic acid by the prefix “ts.”
[0083] A “nucleic acid molecule” refers to any single-stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof. For example and without limitation, the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2 '-deoxyribose group. The nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA. For example, and without limitation, mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase. A nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc. A nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc. A nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
[0084] Nucleic acid(s) that are “complementary” or “complement(s)” are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules. As used herein, the term “complementary” or “complement(s)” may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above. The term “substantially complementary” may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase. In certain embodiments, a “substantially complementary” nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about. 81%, about
82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about
97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization. In certain embodiments, the term “substantially complementary'” refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent, conditions. In certain embodiments, a “partially complementary” nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least, one single or double-stranded nucleic acid molecule during hybridization.
[0085] The term “non-complementary” refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.
[0086] The term “degenerate” as used herein refers to a nucleotide or series of nucleotides wherein the identity can be selected from a variety of choices of nucleotides, as opposed to a defined sequence. In specific embodiments, there can be a choice from two or more different nucleotides. In further specific embodiments, the selection of a nucleotide at. one particular position comprises selection from only purines, only pyrimidines, or from nonpairing purines and pyrimidines.
[0087] The term “secondary structure” as used herein refers to the set of interactions between bases pairs. For example, in a DNA double helix, the two strands of DNA are held together by hydrogen bonds. The secondary/ structure is responsible for the shape that the nucleic acid assumes. For a single stranded nucleic acid, the simplest secondary structure is linear. For a linear secondary’ structure, no two subsequences of a nucleic acid molecule form an intramolecular structure stronger than -2 kcal/mol. As another example for a single stranded nucleic acid, one portion of the nucleic acid molecule may hybridize with a second portion of the same nucleic acid molecule, thereby forming a hairpin to stem loop secondary structure. For a non-linear secondary structure, at least two subsequences of a nucleic acid molecule from an intramolecular structure stronger than -2 kcal/mol.
[0088] As used herein, the term “subsequence” refers to a sequence of at least 5 contiguous base pairs.
[0089] As used herein, the term “mutant DNA Template” or “variant DNA Template” refer to the nucleotide sequence of a nucleic acid that harbors a desired allele, such as a single nucleotide polymorphism, to be amplified, identified, or otherwise isolated. As used herein, the term “wildtype sequence” or “background sequence” refers to the nucleotide sequence of a nucleic acid that does not harbor the desired allele. For example, in some instances, the background sequence harbors the wild-type allele whereas the variant sequence harbors the mutant allele. Thus, in some instance, the background sequence and the variant sequence are derived from a common locus in a genome such that the sequences of each may be substantially homologous except for a region harboring the desired allele, nucleotide or group or nucleotides that varies between the two.
[0090] “Sample” means a material obtained or isolated from a fresh or preserved biological sample or synthetically created source that contains nucleic acids of interest. Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
[0091] The term “secondary structure” as used herein refers to the set of interactions between bases pairs. For example, in a DNA double helix, the two strands of DNA are held together by hydrogen bonds. The secondary' structure is responsible for the shape that the nucleic acid assumes. For a single stranded nucleic acid, the simplest secondary structure is linear. For a linear secondary’ structure, no two subsequences of a nucleic acid molecule form an intramolecular structure stronger than -2 kcal/mol. As another example for a single stranded nucleic acid, one portion of the nucleic acid molecule may hybridize with a second portion of the same nucleic acid molecule, thereby forming a hairpin to stem loop secondary structure. For a non-linear secondary structure, at least two subsequences of a nucleic acid molecule from an intramolecular structure stronger than -2 kcal/mol.
[0092] A ‘‘Target” for a chimeric amplification system described herein can be any single-stranded nucleic acid, such as single-stranded DNA and single-stranded RNA, including double-stranded DNA and RNA rendered single-stranded through heat shock, asymmetric amplification, competitive binding, and other methods standard to the art. A DNA Target may be the product of RNA subjected to reverse transcription. In some instances, a Target may be a mixture (chimera) of DNA and RNA. In other instances, a Target comprises artificial nucleic acid analogs. In some instances, a Target may be naturally occurring (e.g., genomic DNA) or it may be synthetic (e.g., from a genomic library). As used herein, a “naturally occurring” nucleic acid sequence is a sequence that is present in nucleic acid molecules of organisms or viruses that exist in nature in the absence of human intervention or that is present in any biological sample. In some instances, a Target is genomic DNA, messenger RNA, ribosomal RNA, cell-free DNA, micro-RNA, pre-micro- RNA, pro-mi cro-RN A, long non-coding RNA, small RNA, epigenetically modified DNA, epigenetically modified RNA, viral DNA, viral RNA or piwi-RNA. In certain instances, a Target nucleic acid is a nucleic acid that naturally occurs in an organism or vims. In some instances, a Target nucleic is the nucleic acid of a pathogenic organism or virus. In certain instances, a Target of interest is linear, while in other instances, a Target is circular (e.g., plasmid DNA, mitochondrial DNA, or plastid DNA).
[0093] In certain instances, a Target nucleic acid molecule of interest is about 19 to about. 1,000,000 nucleotides (nt) in length. In some instances, the Target is about 19 to about 100, about 100 to about 1000, about 1000 to about 10,000, about 10,000 to about 100,000, or about 100,000 to about 1,000,000 nucleotides in length. In some instances, the Target is about 20, about 100, about 200, about 300, about 400, about 500, about 600, about 700, about 800, about 900, about 1,000, about 2,000, about 3,000, about 4,000, about 5,000, about 6,000, about 7,000, about 8,000, about 9000, about 10,000, about 20,000, about 30,000, about 40,000, about 50,000, about 60,000, about 70,000, about 80,000, about 90,000, about 100,000, about 200,000, about 300,000, about 400,000, about 500,000, about 600,000, about 700,000, about 800,000, about 900,000, or about 1,000,000 nucleotides in length. It is to be understood that the Target nucleic acid may be provided in the context of a longer nucleic acid (e.g., such as a coding sequence or gene within a chromosome or a chromosome fragment).
[0094] “Biological sample” means a material obtained or isolated from a fresh or preserved biological sample or synthetically created source that contains nucleic acids of interest. Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
[0095] As used herein in relation to a nucleotide sequence, “substantially known” refers to having sufficient sequence information in order to permit preparation of a nucleic acid molecule, including its amplification. This will typically be about 100%, although in some embodiments some portion of an adapter sequence is random or degenerate. Thus, in specific embodiments, substantially known refers to about 50% to about 100%, about 60% to about 100%, about 70% to about 100%, about 80% to about 100%, about 90% to about 100%, about 95% to about 100%, about 97% to about 100%, about 98% to about 100%, or about 99% to about 100%.
V. Kits
[0096] The technology described herein includes kits comprising Stopper oligonucleotides as disclosed herein. Exemplary' kits include qPCR kits, Sanger kits, NGS panels, and nanopore sequencing panels. A “kit” refers to a combination of physical elements. For example, a kit may include, for example, one or more components such as nucleic acid Stoppers, nucleic acid Inhibitors, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the invention. [0097] T he components of the kits may be packaged either in aqueous media or in lyophilized form. The container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g., aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial. The kits of the present invention also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained. A kit will also include instractions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented.
VL Examples
[0098] The following examples are included to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention.
Example 1 - Long-read DNA Sequencing through concatenating Chimeric Amplicon Reads in NGS
[0099] To demonstrate the concatenating efficiency of the system, the pairing efficiency was first experimentally validated (FIGS. 7A-C). Mixture 1 and mixture 2 were pre-annealed using the same Target but different pairs of 1st Stopper and 2nd Stopper, and put into the same reaction. When mixing the pre-annealed mixture 1 and mixture 2 at the ratio of 1 to 1, the reads from UMI Sequence la&2a (UMI Set A) was similar to the reads from Identity Sequence lb&2b (UMI Set B) (FIG. 7C). In addition, there was no read showing the mispairing between UMI Set A and UMI Set B, demonstrating that the pre- annealed product did not dissociate and re-anneal in the downstream process. This confirmed that Stoppers are very unlikely to dissociate from the Target thus supporting the concatenation of the full length of each Target molecule.
[00100] Next, the system was applied to the genome of M13 bacteriophage. The reaction system was composed of a 1st Stopper, 2nd Stopper, and 3rd Stopper which all have binding regions on the M13 genome (Target nucleic acid) (FIG. 8A), The system was designed such that the T4 Polymerase would extend from the 1 st Stopper and switch templates with the help of the 2nd Stopper, generating lst-2nd Amplicon, Similarly, the T4 Polymerase would generate the 2nd-3rd Amplicon based on the 2nd and the 3rd Stopper. The lst-2nd Amplicon and the 2nd-3rd Amplicon, which share the same 2nd UMI Sequence are defined as lst-2nd-3rd UMI linked molecules. Summary of NGS data is shown in FIG. 8A. The size of unique molecular pairs of lst-2nd Amplicons is 3545, close to that of 2nd-3rd Amplicons, which is 3983. In addition, the amount and percentage of Amplicon pairs where the lst-2nd Amplicon shares the same 2nd Identity Sequence with the 2nd-3rd Amplicon was calculated. About 2119 molecules, that is 59.77% of the Amplicons, could be linked together by the same 2nd UMI Sequences. The UMI family size distribution plots of lst-2nd Amplicon and the 2nd-3rd Amplicon are shown FIG. 8C,
[00101] The missing attachment of Stopper 2 would lead to the generation of lst-3rd WT Amplicon, if by any chance, the 2nd Stopper falls off after annealing and before polymerase extending, there could be some unwanted or side products. No reads were found indicating the existence of lst-3rd WT Amplicon (see FIG. 8B), which matches the results shown in FIG. 7C.
[00102] Experimental demonstration on the feasibility of detecting both the wildtype and variant Target nucleic simultaneously is shown in FIGS. 9A-C. The reaction system comprised the 1st Stopper, 2nd Stopper, and 3rd Stopper and the wildtype Target (FIG. 9 A). Thus, the lst-2nd Amplicon and the 2nd-3rd Amplicon should be generated. When 1st Stopper, 2nd Stopper, and 3rd Stopper were mixed with the variant Target, only the lst-3rd var Amplicon should be generated (FIG. 9B). When mixing the wildtype and the variant Target together, the reads and the UMI family sizes of the 1 st-2nd Amplicon and the 2nd-3rd Amplicon are close to that in wildtype only group (FIG. 9C). Also, the reads of the lst-3rd Var Amplicon were close to that in the variant only group as well. Example 2 - MATLAB simulation
[00103] MATLAB simulation of the number of reads and molecules needed for N amplicons are shown in FIGS. 10A-C. As shown in FIG. 10A, when the number of Amplicons is 5, only 0.089M reads are necessary to sequence all the Amplicons with a sufficient number of continuous linked UMI families (100X) and good sequencing depth (15X) on each UMI family. However, when there are 25 Amplicons, 559M reads are required for optimal sequencing coverage as well as enough number of UMI families, which is 6280- fold to the reads needed for 5 Amplicons. Even when comparing the reads needed for 25 Amplicons with that for 20 Amplicons, 6.5X more reads will be required to expand from 20 Amplicons to 25 Amplicons.
[00104] The growth curve of the need of input molecules is similar to that of the sequencing reads. When the number of Amplicons is 5, only 0.6k input molecules are necessary' to obtain sufficient continuous linked UMI family (100X), but when the number of Amplicons increases to 25, 745k molecules are necessary for sufficient continuous linked UMI families, which is 1241-fold to the number of molecules needed for 5 Amplicons. Even when comparing the molecules needed for 25 Amplicons with that needed for 20 Amplicons, 5X more input of molecules will be required to expand from 20 Amplicons to 25 Amplicons.
[00105] This simulation demonstrates that the increase of Amplicon number would significantly increase the demand for sequence reads and input molecules. Therefore, when the total length of Target is long, it is more economical to increase the length of each Amplicon rather than to increase the total number of Amplicons.
Example 3 - Bioinformaties pipeline
[00106] The bioinformatics pipeline for single-end sequencing is shown in FIGS. 1 1 A-B. If the length of Sequencing-Insert of the library, i.e., the total length of library' deducted with length of Sequencing adaptor, is less than the sequencing length (typically, it ranges from 75nt to 600nt), then it is possible to collect both the information of UMI left (5' of the read) and UMI right (3' of the read) within every7 single-end read. N reads sequenced and generated from the sequencer and put in the FASTQ file. Each read should be composed of the UMI left, the Insert Sequence, UMI right and the complete or partial sequence of the Sequencing adaptor. Then, the adaptor sequence of each read would be trimmed. The trimmed reads would then be grouped by the same Insert Sequence first, and then be further collapsed based on UMI set (UMI left + UMI right). Finally, molecules with different Insert Sequences are linked together based on the share of the same UMI sequence. One example is shown in FIG. 1 I B. If the sequence of UMI right of a molecule with Insert Sequence 1 is the same as the sequence of UMI left of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and could be linked together.
[00107] Workflow for paired-end sequencing is shown in FIGS. 12A-B. Suppose the length of Sequencing-Insert of the library, i.e., the total length of library deducted with the length of Sequencing adaptor, is more than the sequencing length (typically, it ranges from 75nt to 600nt). In that case, paired-end sequencing is necessary to sequence UMIs at both sides of the library. When there are N pairs of reads sequenced and generated from the sequencer, each R1 should be composed of the UMI left and partial Insert Sequence, each R2 composes the UMI right and another part of Insert Sequence. Each UMI set is a UM I left and a UMI right from a pair of R1 and R2. Then each pair of reads would be clustered based on the result of alignment, that is, two pairs of reads would be considered having the same Insert Sequence if they could be aligned to the identical sub-region of the Target, then each pair of read be further collapsed based on the same UMI set. Finally, molecules with different Insert Sequences are linked together based on the share of the same UMI sequence. For example, as shown in FIG. 12B, if the sequence of UMI right of a molecule with Insert Sequence 1 is the same as the sequence of UMI left of another molecule with Insert Sequence 2, then the two molecules are originated from the same Target and can be linked together.
Example 4 - Next Generation Sequencing (NGS) Library Preparation Protocols
[00108] The data for the NGS experiments summarized in FIG. 7C were collected using an Illumina NextSeq instrument and a NextSeq 500/550 Mid Output Kit v2.5 2x150 cycle paired-end kit. Each library' input was 60,000 copies of Target 1 DNA oligonucleotide (Integrated DNA Technology). The library preparation process is briefly summarized below:
[00109] 1. Prepare Mixture 1 : mix 0.5 pl lOOfM Target 1 DNA oligonucleotide, 1 ul I pM 1 st Stopper la, 1 pl l uM 2nd Stopper 2a, 1.17 pl water with 1.34 pl 5X T4 DNA Polymerase Reaction buffer (Thermo Fisher), and perform the annealing step in Eppendorf Mastercycler. The thermal cycling protocol was as follows: 1. 95 °C 2 minutes. 2. Cool from 95 °C to 20 °C at a ramp speed of 0.01 °C per 6 seconds. Lid Temperature was set at 105 °C.
[00110] 2. Prepare Mixture 2: Mix 0.5 pl lOOfM Target 1 DNA oligonucleotide, 1 pl lsuM 1st Stopper l a, Ipl I pM 2nd Stopper 2a, 1.17 pl water with 1.34 pl 5X T4 DNA Polymerase Reaction buffer, and perform the Annealing Step in Eppendorf Mastercycler with the same protocol as above.
[00111] 2.1. For libraries with only mixture 1 or mixture 2, perform isothermal
PCR (37 °C for 35 minutes, 75 °C for 10 minutes, lid is set at 90 °C) by adding 1 pl T4 DNA Polymerase (Thermo Fisher), 10 pl 5X T4 DNA reaction buffer (Thermo Fisher), 1 pl l OmM dNTP (New England Biolabs) mixture and 33 pl nuclease-free water into the 5 pl annealed mixtures for a final 50 pl reaction volume.
[00112] 2.2. For libraries with the 1 : 1 mixture of mixture 1 and mixture 2, perform isothermal PCR (37 °C for 35 minutes, 75 °C for 10 minutes, lid is set at 90 °C) by adding 1 pl T4 DNA Polymerase (Thermo Fisher), 10 pl 5X T4 DNA reaction buffer (Thermo Fisher), 1 pl lOmM dNTP (New England Biolabs) mixture and 33 pl nuclease-free water into the rationally mixed products from annealing (2.5 pl mixture 1 and 2.5 pl mixture 2) for a final 50 pl reaction volume.
[00113] 3. Perform DNA purification using 1.4x SPRI beads.
[00114] 4. Perform 28 cycles PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 400 nM of each primer, amplifying generated chimeric amplicons.
[00115] 5. Perform DNA purification using 1.35x SPRI beads.
[00116] 6. Perform 2 cycles adapter PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using ITaq mastermix (Bio-Rad), using 400 nM of each adaptor primer.
[00117] 7. Perform DNA purification using lx SPRI beads.
[00118] 8a. For libraries with only mixture 1 or mixture 2, perform 8 cycles index PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 400 nM index primers. [00119] 8b. For libraries with both mixture 1 and mixture 2, perform 10 cycles for index PCR (95 °C for 10 seconds, 60 °C for 30 seconds) iTaq mastermix (Bio-Rad), using 400 nM index primers.
[00120] 9. Perform DNA purification using 0.9x SPR1 beads.
Example 5 - NGS Bioinformatic Analysis Methods
[00121] The method for analyzing NGS reads from NGS FASTQ files in FIG. 7C is summarized below:
[00122] 1. Trim adapters sequences from each read.
[00123] 2. Count the number of insert reads that perfectly match the 1st UMI
Sequence la or 1 st UMI Sequence lb in Read 1 (Rl). Record the corresponding read ID. Any degenerate nucleotides in the reads, such as N, are considered mismatched and discarded.
[00124] 3. Count the number of insert reads that perfectly match the 2nd UMI
Sequence 2a or 2nd UMI Sequence 2b in Read 2. Record the corresponding read ID. Any degenerate nucleotides in the reads, such as N, are considered mismatched and are discarded.
[00125] 4. Pairing each 1 st UMI Sequence and each 2nd UMI Sequence based on the same read ID.
[00126] 5. Count the number of UMI Set A (1st UMI Sequence la and 2nd
UMI Sequence 2a) and the number of UMI Set B (1st UMI Sequence lb and 2nd UMI Sequence 2b).
Example 6 •■■■ Next Generation Sequencing (NGS) Library Preparation Protocols
[00127] The data for the NGS experiments summarized in FIG. 8 were collected using an Illumina NextSeq instrument and a NextSeq 500/550 Mid Output Kit v2.5 2x150 cycle paired-end kit (FIG. 8) or Illumina MiSeq instrument and a Miseq micro v2-300 paired-end kit. Each library/ input 7,500 copies of Target. Library’ shown in FIG. 8 and wildtype Target shown in FIG. 7 input 7,500 copies M13mpl 8 single-stranded DNA oligonucleotide (New England Biolads). Variant Target in FIG. 9 is rationally designed oligonucleotide (Integrated DNA Technology). The library preparation process is briefly summarized below: [00128] 1. Mix 1 pl 500nM 1st Stopper, 1 pl 500nM 2nd Stopper, 1 pl 500nM
3rd Stopper, 3 pl M13mpl8 ssDNA (2,500 copies/pl) or 2 pl variant Target (3,750 copies/ pl), 1.4 pl water or 2.4 pl water with 2.6 pl 5X T4 DNA Polymerase Reaction buffer (Thermo Fisher), and perform the annealing step in Eppendorf Mastercycler. The thermal cycling protocol is as follows: 1. 95 °C 2 minutes. 2. Cool from 95 °C to 20 °C at the ramp speed of 0.01 °C per 6 seconds. Lid Temperature is set at 105 °C.
[00129] 2.1 For library' shown in FIG. 7, perform isothermal PCR (37 °C for 35 minutes, 75 °C for 10 minutes, lid is set at 90 °C) by adding lul T4 DNA Polymerase (Thermo Fisher), lOul 5X T4 DNA reaction buffer (Thermo Fisher), lul lOmM dNTP (New England Biolabs) mixture, 2ul lOOOnM each Inhibitor and 22ul nuclease-free water into the lOul annealed mixtures for a final 5()ul reaction volume.
[00130] 2.2 For libraries shown in Fig.8, perform isothermal PCR (37 °C for 35 minutes, 75 °C for 10 minutes, lid is set at 90 °C) by adding lul T4 DNA Polymerase (Thermo Fisher), lOul 5X 1'4 DNA reaction buffer (Thermo Fisher), lul lOmM dNTP (New England Biolabs) mixture, and 28ul nuclease-free water into the lOul annealed mixtures for a final 50ul reaction volume.
[00131] 3. Perform DNA purification using 1.3x SPRI beads.
[00132] 4. Perform 23 cycles PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 300nM of each primer, amplifying generated chimeric amplicons.
[00133] 5. Perform DNA purification using 1.3x SPRI beads.
[00134] 6. Perform 2 cycles adapter PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 150nM of each adaptor primer.
[00135] 7. Perform DNA purification using lx SPRI beads.
[00136] 8.1. For library shown in Fig. 7, perform 9 cycles index PCR (95 °C for
10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 400nM index primers. [00137] 8.2. For libraries shown in Fig. 8, perform 6 cycles index PCR (95 °C for 10 seconds, 60 °C for 30 seconds) using iTaq mastermix (Bio-Rad), using 400nM index primers.
[00138] 9. Perform DNA purification using 0.7x SPRI beads.
Example 7 - NGS Bioinformatic Analysis Methods
[00139] The methods used to analyze NGS reads from NGS FASTQ files in FIG. 8 and FIG. 9 are summarized below:
[00140] 1. Trim adapters sequences from each read.
[00141] 2. Count the number of insert paired-end reads that perfectly match the lst-2nd Amplicon (lst-2nd Amplicon Reads), 2nd-3rd Amplicon (2nd-3rd Amplicon Reads) or lst-3rd Var Amplicon (lst-3rd Var Amplicon Reads). Any degenerate nucleotides in the reads, such as N, are considered mismatched and are discarded.
[00142] 3. Count the number of Unique Molecular Pair (l.X IP) of lst-2nd
Amplicon. For each pair of read that perfectly match the sequence of lst-2nd .Amplicon, there is a 1st UMI Sequence and a 2nd UMI Sequence. For each unique 1st UMI Sequence, search through all the 2nd UMI Sequence and find all the 2nd UMI Sequences related to the same 1st UMI Sequence, only keep the pair as one Unique Molecular Pair and collect the most dominant 2nd UMI Sequence when there is only one 2nd UMI Sequence taking over 90% dominancy. Count the number of UMP 12.
[00143] 4. Count the number of Unique Molecular Pair (UMP) of 2nd-3rd
Amplicon. For each pair of read that perfectly match the sequence of 2nd-3rd Amplicon, there is a 2nd UMI Sequence and a 3rd UMI Sequence. For each unique 2nd UMI Sequence, search through all the 3rd UMI Sequence and find all the 3rd UMI Sequences related to the same 2nd UMI Sequence, only keep the pair as one Unique Molecular Pair and collect the most dominant 3rd UMI Sequence when there is only one 3rd UMI Sequence taking over 90% dominancy. Count the number of UMP 23.
[00144] 5. Count the number of Unique Molecular Pair (UMP) of 1 st~3rd Var
Amplicon. For each pair of read that perfectly match the sequence of lst-3rd Var Amplicon, there is a 1st UMI Sequence and a 3rd UMI Sequence. For each unique 1st UMI Sequence, search through ail the 3rd UMI Sequence and find all the 3rd UMI Sequences related to the same 1st UMI Sequence, only keep the pair as one Unique Molecular Pair and collect the most dominant 3rd UMI Sequence when there is only one 3rd UMI Sequence taking over 90% dominancy. Count the number of UMP 13. [00145] 6. Link the UMP 12 and LIMP 23 by the same UMI 2 sequence.
Table 1. Sequences
Figure imgf000044_0001
Figure imgf000045_0001
Figure imgf000046_0001
Figure imgf000047_0001
Figure imgf000048_0001
Figure imgf000049_0001
[00146] All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
REFERENCES
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
U.S. Provisional Application No. 63/182,154, filed April 30, 2021

Claims

WHAT IS CLAIMED IS
1. A composition comprising: a 1st Stopper oligonucleotide, wherein the 1st Stopper comprises from 5! to 3' a 1st Identity Sequence with a length between 5 nt and 200nt, a 1st C Sequence with a length between 6nt and 500nt, and a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises from 5' to 3‘ a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, and a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, and a 2nd C Sequence with a length between 6nt and 5()0nt, and a 3rd Stopper oligonucleotide, wherein the 3rd Stopper comprises from 5' to 3' a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and a 3rd C Sequence with a length between 6nt and 500nt, and a template-dependent polymerase enzyme, and reagents and buffers needed for polymerase function, wherein the 1st C Sequence is complementary to a 1st Binding Region sequence on a potential Target nucleic acid, wherein the 2nd C Sequence is complementary to a 2nd Binding Region sequence on the potential Target nucleic acid, wherein the 2nd B Sequence is complementary to a 2nd Match Region sequence to the 3’ of the 2nd Binding Region of the potential Target nucleic acid, wherein the 3rd C Sequence is complementary' to a 3rd Binding Region sequence on a potential Target nucleic acid, wherein the 3rd B Sequence is complementary to a 3rd Match Region sequence to the 3’ of the 3rd Binding Region of the potential Target nucleic acid, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, and wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and I'), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). omposition of claim 1, further comprising said Target nucleic acid. composition of claim 1 or 2, wherein the 1st Identity Sequence comprises a 1st UMI
Sequence. composition of claim I or 2, wherein the 3rd Identity Sequence comprises a 3rd UMI
Sequence. omposition of claim 3 or 4, wherein each UMI Sequence comprises a set of designed
DNA sequences. composition of claim 3 or 4, wherein the UMI Sequences comprise degenerate nucleotides, selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). omposition of claim I, 3, or 4, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. omposition of claim I, 3, or 4, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. omposition of any one of claims 1-8, wherein the 1 st Stopper further comprises, from
5' to 3', a 1st A Sequence with a length between 3 nt and 5 Ont, a 1st Loop Sequence with a length between 3 nt and 70nt, and a 1st B Sequence with a length between 3 nt and 50nt, wherein the 1st B Sequence is the reverse complement of the 1st A Sequence. composition of ciaim 9, wherein the potential Target comprises a 1 st Match Region to the 3' of the 1st Binding Region and is complementary to the 1st B Sequence. composition of any one of claims 1-10, wherein the polymerase is a thermostable polymerase. composition of any one of claims 1-10, wherein the polymerase is not a thermostable polymerase. composition of any one of claims 1-12, further comprising at least one Inhibitor
DNA oligonucleotide, wherein the Inhibitor has a subsequence that is reverse complementary to a subsequence of the C Sequence of a corresponding Stopper. e composition of claim 13, wherein the Inhibitor DNA oligonucleotide has a subsequence at the 3! end at least 3nt long that is not reverse complementary to the corresponding Stopper. e composition of claim 14, wherein the subsequence at the 3' end of Inhibitor oligonucleotide forms at least one hairpin structure. composition of claim 13, wherein the Inhibitor DNA oligonucleotide comprises nonnatural nucleotides. composition of claim 16, wherein the Inhibitor DNA oligonucleotide has a chemical functionalization at the 3' end that prevents polymerase extension, including but not limited to a 3-carbon spacer, an inverted nucleotide, or a minor groove binder. composition of any one of claims 13-17, wherein the stoichiometric ratio of the corresponding Stopper and the Inhibitor oligonucleotide is between 0.8 and 100. composition of any one of claims 1-18, wherein the mixture comprises at least four
Stoppers, wherein each Stopper, except the 1st Stopper, comprises, from 5! to 3': an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a. length between 3 nt and 70nt, and, a B Sequence with a length between 3nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 50()nt. composition of claim 19, wherein each Identity Sequence comprises a UMl Sequence comprising a set of designed DNA sequences, wherein the UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). composition of claim 19 or 20, wherein the 1st Stopper comprises, from 5* to 3', a 1st Identity Sequence with length between 5nt and 200nt, and a C Sequence of between 6nt and 500nt. composition of claim 19 or 20, wherein the 1 st Stopper comprises, from 5' to 3', a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st A Sequence with a length between 3nt and 5()nt, a Loop Sequence with a length between 3nt and 70nt, and, a B Sequence with a length between 3nt and 5 Ont, wherein the B Sequence is the reverse complement of the A Sequence, a C Sequence with a length between 6nt and 500nt. composition of claim 22, wherein the 1st Identity Sequence comprises a 1st UMI
Sequence comprising a set of designed DNA sequences, wherein the 1st UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). composition of claim 20 or 23, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. composition of claim 20 or 23, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. composition of any one of claims 19-25, wherein the potential Target comprises a
Binding Region that is reverse complementaiy to a corresponding C Sequence for each of the at least four Stoppers. composition of claim 26, wherein the potential Target comprises a Match Region that is reverse complementary to a corresponding B sequence for each of the at least four Stoppers, wherein each Match Region is located to the 3’ of the Binding Region corresponding to the same Stopper. composition of claim 26, wherein the potential Target comprises a Match Region that is reverse complementary to a corresponding B Sequence. composition of any one of claims 1-28, wherein, at a salinity of 0.2M sodium and a temperature of 60 °C, the C Sequence of the Stopper and the potential Target has a standard free energy of hybridization AG°i between -7 kcal/mol and -50 kcal/mol, the A Sequence and the B Sequence of each Stopper has a standard free energy of hybridization AG°? between -2 kcal/mol and -50 kcal/mol, and/or the Inhibitor and the C Sequence of the Stopper has a standard free energy of hybridization AG°? between -7 kcal/mol and -50 kcal/mol. ethod for high throughput DNA sequencing, the method comprising:
(1) mixing a Sample comprising a Target molecule comprising, from 5' to 3', a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1 st Binding Region with a template-dependent polymerase, reagents and buffers needed for enzymatic function, a 1st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3', a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary to the 1 st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3', a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3 nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, and a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, the 2nd B Sequence is reverse complementary to the 2nd Match Region, and a 2nd C Sequence with a length between 6nt and 500nt, wherein the 2nd C Sequence is the reverse complement of the 2nd Binding Region, and a 3rd Stopper oligonucleotide, wherein the 3rd Stopper comprises, from 5' to 3', a 3rd Identity Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3 nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and the 3rd B Sequence is reverse complementary to the 3rd Match Region, and a 3rd C Sequence with a length between 6nt and 500nt, wherein the 3rd C Sequence is the reverse complement of the 3rd Binding Region, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C);
(2) incubating the mixture at a temperature conducive to polymerase activity, thereby generating Amplicons, wherein each Amplicon comprises, from 5' to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary to the Match Region corresponding to the upstream Stopper, and an Identity-Complement. Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
(3) purifying the Amplicons,
(4) appending a sequencing adaptor or sequencing index to one side or both sides of the purified Amplicons, thereby generating a library/; and
(5) subjecting the library to high-throughput sequencing. method of claim 30, wherein the method further comprises:
(6) clustering reads with the same Insert Sequence and collapsing reads with both the same 5' and 3' Identity Sequences; and
(7) concatenating neighboring reads through the same Identity Sequence, wherein: if the 3' Identity Sequence of read a is the same as the 5‘ Identity Sequence of read b, then the reads are concatenated by putting read b downstream of read a. if the 5' Identity Sequence of read a is the same as the 3* Identity Sequence of read b, then the reads are concatenated by putting read b upstream of read a. method of claim 30 or 31, wherein the 1st Identity Sequence comprises a 1st UMI
Sequence. method of claim 30 or 31, wlierein the 3rd Identity Sequence comprises a 3rd UMI
Sequence. method of claim 32 or 33, wherein each UMI Sequence comprises a set of designed
DNA sequences. e method of any one of claims 32-34, wherein the UMI Sequences comprise degenerate nucleotides, selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). , The method of ciaim 30, 32, or 33, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. . The method of claim 30, 32, or 33, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. . The method of any one of claims 30-37, wherein the 1st Stopper further comprises, from
5' to 3', a 1 st A Sequence with a length between 3 nt and 5 Ont, a 1st Loop Sequence with a length between 3nt and 70nt, and a 1st B Sequence with a length between 3 nt and 50nt, wherein the 1st B Sequence is the reverse complement of the 1st A Sequence. . The method of ciaim 38, wherein the Target comprises a 1st Match Region to the 3’ of the
1st Binding Region, wherein the 1st Match Region is complementary' to the 1st B Sequence. . The method of any one of claims 30-39, wherein the Sample is further mixed with at least one Inhibitor DNA oligonucleotide, wherein the Inhibitor has a subsequence that is reverse complementary to a subsequence of the C Sequence of a corresponding Stopper. . The method of claim 40, wherein the Inhibitor DNA oligonucleotide has a subsequence at its 3‘ end that, is at least 3nt long and that is not reverse complementary to the corresponding Stopper. . The method of claim 41 , wherein the subsequence at the 3’ end of the Inhibitor DNA oligonucleotide forms at least one hairpin structure. . The method of claim 40, wherein the Inhibitor DNA oligonucleotide comprises nonnatural nucleotides. . The method of claim 43, wherein the Inhibitor DNA oligonucleotide has a chemical functionalization at its 3' end that prevents polymerase extension, including but not limited to a 3-carbon spacer, an inverted nucleotide, or a minor groove binder. method of any one of claims 30-44, wherein the incubation occurs at a temperature between 10 °C and 74 °C for between 1 second and 20 hours. method of any one of claims 30-44, wherein the incubation comprises thermal cycling alternating between a temperature higher than 78 °C for between 1 second and 30 minutes and a temperature not higher than 75 °C for between 1 second and 20 hours. method of any one of claims 30-46, wherein the method comprises at least 6 thermal cycles. e method of any one of claims 30-47, wherein the sequencing adapters and/or sequencing indexes are appended via ligation. e method of any one of claims 30-47, wherein the sequencing adapters and/or sequencing indexes are appended via PCR. method of any one of claims 30-49, wherein the high-throughput sequencing is performed via sequencing-by-synthesis. method of any one of claims 30-49, wherein the high-throughput sequencing is performed via electrical current measurements in conjunction with a nanopore. method of any one of claims 30-51, wherein, at a salinity of 0.2M sodium and a temperature of 60 °C, the C Sequence of the Stopper and the Target has a standard free energy of hybridization AG°i between -7 kcal/mol and -50 kcal/mol, the A Sequence and the B Sequence of each Stopper has a standard free energy of hybridization AG°2 between -2 kcal/mol and -50 kcal/mol, and/or the Inhibitor and the C Sequence of the Stopper has a standard free energy of hybridization AG0? between -7 kcal/mol and -50 kcal/mol. method of any one of claims 30-52, wherein the polymerase is a thermostable polymerase. method of any one of claims 30-52, wherein the polymerase is not a thermostable polymerase. method of any one of ciaims 30-54, wherein the Sample comprising the Target nucleic acid is mixed with at least four Stoppers, wherein each Stopper, except the 1st Stopper, comprises, from 5' to 3’: an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, and, a B Sequence with a length between 3nt. and 5 Ont, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt. method of claim 55, wherein each Identity Sequence comprises a UMI Sequence comprising a set of designed DNA sequences, wherein the UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C), method of claim 55 or 56, wherein the 1st Stopper comprises, from 5' to 3’, a 1st Identity7 Sequence with length between 5nt and 200nt, and a C Sequence of between 6nt and 500nt„ method of claim 55 or 56, wherein the 1st Stopper comprises, from 5' to 3', a 1 st Identity Sequence with a length between 5nt and 200nt, and a 1st A Sequence with a length between 3 nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, and, a B Sequence with a length between 3nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, a C Sequence with a length between 6nt and 500nt. method of claim 58, wherein the 1st Identity Sequence comprises a 1 st UMI
Sequence comprising a set of designed DNA sequences, wherein the 1st UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). method of claim 56 or 59, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. method of claim 56 or 59, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. method of any one of claims 55-61, wherein the Target comprises a Binding Region that is reverse complementary to a corresponding C Sequence for each of the at least four Stoppers. method of claim 62, wherein the Target comprises a Match Region that is reverse complementary to a corresponding B sequence for each of the at least four Stoppers, wherein each Match Region is located to the 3' of the Binding Region corresponding to the same Stopper. method of claim 62, wherein the Target comprises a Match Region that is reverse complementary' to a corresponding B Sequence. ethod for high throughput DNA sequencing, the method comprising:
(1) mixing a Sample comprising a Target molecule comprising, from 5’ to 3', a 3rd Binding Region, a 3rd Match Region, a 2nd Binding Region, a 2nd Match Region and a 1st Binding Region with an annealing buffer, a 1st Stopper oligonucleotide, wherein the 1st Stopper comprises, from 5' to 3', a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st C Sequence with a length between 6nt and 500nt, wherein the C Sequence is reverse complementary to the 1 st Binding Region, a 2nd Stopper oligonucleotide, wherein the 2nd Stopper comprises, from 5' to 3', a 2nd Identity Sequence with a length between 5nt and 200nt, wherein the 2nd Identity Sequence comprises a 2nd UMI Sequence, a 2nd A Sequence with a length between 3nt and 50nt, a 2nd Loop Sequence with a length between 3nt and 70nt, and a 2nd B Sequence with a length between 3nt and 50nt, wherein the 2nd B Sequence is the reverse complement of the 2nd A Sequence, the 2nd B Sequence is reverse complementary to the 2nd Match Region, and a 2nd C Sequence with a length between 6nt and 500nt, wherein the 2nd C Sequence is the reverse complement of the 2nd Binding Region, and a 3rd Stopper oligonucleotide, wherein the 3rd Stopper comprises, from 5‘ to 3; a 3rd Identity' Sequence with a length between 5nt and 200nt, a 3rd A Sequence with a length between 3nt and 50nt, a 3rd Loop Sequence with a length between 3nt and 70nt, and a 3rd B Sequence with a length between 3nt and 50nt, wherein the 3rd B Sequence is the reverse complement of the 3rd A Sequence, and the 3rd B Sequence is reverse complementary to the 3rd Match Region, and a 3rd C Sequence with a length between 6nt and 500nt, wherein the 3rd C Sequence is the reverse complement of the 3rd Binding Region, wherein the 2nd UMI Sequence comprises a set of designed DNA sequences, wherein the 2nd UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M: (mixture of A and C);
(2) thermal annealing the mixture;
(3) adding a template-dependent polymerase, reagents, and buffers needed for enzymatic function;
(4) incubating the mixture at a temperature conducive to polymerase activity, thereby generating Amplicons, wherein each Amplicon comprises, from 5’ to 3', a sequence derived from an upstream Stopper, an Insert Sequence that is the reverse complement of a subsequence between the Match Region corresponding to the downstream Stopper and the Binding Region corresponding to a upstream Stopper, a Match-Complement Sequence that is reverse complementary to the Match Region corresponding to the upstream Stopper, and an Identity -Complement Sequence that is the reverse complement of the Identity Sequence of the downstream Stopper;
(5) purifying the Amplicons;
(6) appending a sequencing adaptor or sequencing index to one side or both sides of the purified Amplicons, thereby generating a library'; and
(7) subjecting the library' to high-throughput sequencing. method of claim 65, wherein the method further comprises:
(8) clustering reads with the same Insert Sequence and collapsing reads with both the same 5’ and 3’ Identity Sequences; and
(9) concatenating neighboring reads through the same Identity Sequence, wherein: if the 3' Identity Sequence of read a is the same as the 5' Identity Sequence of read b, then the reads are concatenated by putting read b downstream of read a. if the 5' Identity Sequence of read a is the same as the 3' Identity Sequence of read b, then the reads are concatenated by putting read b upstream of read a. method of claim 65 or 66, wherein the 1st Identity Sequence comprises a 1 st UMI
Sequence. method of claim 65 or 66, wherein the 3rd Identity Sequence comprises a 3rd UMI
Sequence. method of claim 67 or 68, wherein each UMI Sequence comprises a set of designed
DNA sequences. e method of any one of claims 67-69, wherein the UMI Sequences comprise degenerate nucleotides, selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). method of claim 65, 67, or 68, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. method of claim 65, 67, or 68, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. method of any one of claims 65-72, wherein the 1st Stopper further comprises from
5' to 3', a 1st A Sequence of between 3nt and 50nt, a 1st Loop Sequence of between 3 nt and 70nt, and a 1st B Sequence of between 3nt and 5 Ont, wherein the 1st B Sequence is the reverse complement of the 1 st A Sequence. method of claim 73, wherein the Target comprises a 1st Match Region to the 3’ of the
1st Binding Region and is complementary' to the 1st B Sequence. method of any one of claims 65-74, wherein the system further comprises at least one
Inhibitor DNA oligonucleotide, the Inhibitor has a subsequence that, is reverse complementary' to a subsequence of the C Sequence of the corresponding Stopper. method of claim 75, wherein the Inhibitor DNA oligonucleotide has a subsequence at the 3’ end at least 3nt long that, is not reverse complementary' to the corresponding Stopper. method of claim 76, wherein the subsequence at the 3‘ end of Inhibitor DNA oligonucleotide forms at least one hairpin structure. method of claim 75, wherein the Inhibitor DNA oligonucleotide comprises nonnatural nucleotides. method of claim 78, wherein the Inhibitor DNA oligonucleotide has a chemical functionalization at the 3' end that prevents polymerase extension, including but not limited to a 3-carbon spacer, an inverted nucleotide, or a minor groove binder. e method of any one of claims 65-79, wherein the annealing step comprises a thermocycling program cooling from a temperature not lower than 78 °C to a temperature not higher than 25 °C. method of claim 80, wherein the thermocycling program comprises steps that cool from 78 °C to 25 °C, wherein the solution is held at each 5°C temperature window for at least 5 minutes. method of any one of claims 65-79, wherein the annealing step comprises subjecting the mixture to a temperature for between 10 minutes and 24 hours. method of claim 82, wherein the annealing step comprises subjecting the mixture to room temperature for between 10 minutes and 24 hours. method of any one of claims 65-83, wherein the incubation occurs at a temperature between 10 °C and 74 °C for between 1 second and 20 hours. method of any one of claims 65-83, wherein the incubation comprises thermal cycling alternating between a temperature higher than 78 °C for between 1 second and 30 minutes and a temperature not higher than 75 °C for between 1 second and 20 hours. method of claim 85, wherein the method comprises at least 6 thermal cycles. e method of any one of claims 65-86, wherein the sequencing adapters and/or sequencing indexes are appended via ligation. e method of any one of claims 65-86, wherein the sequencing adapters and/or sequencing indexes are appended via PCR. method of any one of claims 65-88, wherein the high-throughput sequencing is performed via sequencing-by-synthesis. method of any one of claims 65-88, wherein the high-throughput sequencing is performed via electrical current measurements in conjunction with a nanopore. method of any one of claims 65-90, wherein, at a salinity of 0.2M sodium and a temperature of 60 °C, the C Sequence of the Stopper and the Target has a standard free energy of hybridization AG°i between -7 kcal/mol and -50 kcal/mol, the A Sequence and the B Sequence of each Stopper has a standard free energy of hybridization AGA between -2 kcal/mol and -50 kcal/mol, and/or the Inhibitor and the C Sequence of the Stopper has a standard free energy of hybridization AG°3 between -7 kcal/mol and -50 kcal/mol. . The method of any one of claims 65-91, wherein the polymerase is a thermostable polymerase. . The method of any one of claims 65-91, wherein the polymerase is not a thermostable polymerase. . The method of any one of claims 65-93, wherein the Sample comprising the Target nucleic acid is mixed with at least four Stoppers, wherein each Stopper, except the 1st Stopper, comprises, from 5' to 3': an Identity Sequence with a length between 5nt and 200nt, an A Sequence with a length between 3nt and 50nt, a Loop Sequence with a length between 3nt and 70nt, and, a B Sequence with a length between 3nt and 50nt, wherein the B Sequence is the reverse complement of the A Sequence, and a C Sequence with a length between 6nt and 500nt. . The method of claim 94, wherein each Identity Sequence comprises a UMI Sequence comprising a set of designed DNA sequences, wherein the UMI Sequence comprises degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and I'), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). . The method of claim 94 or 95, wherein the 1st Stopper comprises, from 5' to 3', a 1st Identity' Sequence with length between 5nt and 200nt, and a C Sequence of between 6nt and 500nt. , The method of claim 94 or 95, wherein the 1st Stopper comprises, from 5' to 3’, a 1st Identity Sequence with a length between 5nt and 200nt, and a 1st A Sequence with a length between 3nt and 5()nt, a Loop Sequence with a length between 3nt and 70nt, and, a B Sequence with a length between 3nt and 5()nt, wherein the B Sequence is the reverse complement of the A Sequence, a C Sequence with a length between 6nt and 500nt. e method of claim 97, wherein the 1st Identity Sequence comprises a 1st UMI
Sequence comprising a set of designed DNA sequences, wherein the 1st UMI Sequence compri ses degenerate nucleotides selected from N (mixture of A, C, G, and T), B (mixture of C, G, and T), D (mixture of A, G, and T), H (mixture of C, A, and T), V (mixture of A, C, and G), S (mixture of C and G), W (mixture of A and T), R (mixture of A and G), Y (mixture of T and C), K (mixture of G and T), and M (mixture of A and C). method of claim 95 or 98, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Hamming distance of 2, 3, 4, or 5. e method of claim 95 or 98, wherein the UMI sequence is a mixture of 10 to 1000 defined DNA sequences with a minimum pairwise Levenshtein distance of 2, 3, 4, or 5. e method of any one of claims 94-100, wherein the Target comprises a Binding
Region that is reverse complementary' to a corresponding C Sequence for each of the at least four Stoppers. e method of claim 101, wherein the Target comprises a Match Region that is reverse complementary to a corresponding B sequence for each of the at least four Stoppers, wherein each Match Region is located to the 3' of the Binding Region corresponding to the same Stopper. e method of claim 101, wherein the Target comprises a Match Region that is reverse complementary' to a corresponding B Sequence.
PCT/US2023/066813 2022-05-10 2023-05-10 Long-range dna sequencing through concatenating chimeric amplicon reads WO2023220621A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263340250P 2022-05-10 2022-05-10
US63/340,250 2022-05-10

Publications (1)

Publication Number Publication Date
WO2023220621A1 true WO2023220621A1 (en) 2023-11-16

Family

ID=88731094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/066813 WO2023220621A1 (en) 2022-05-10 2023-05-10 Long-range dna sequencing through concatenating chimeric amplicon reads

Country Status (1)

Country Link
WO (1) WO2023220621A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160068903A1 (en) * 2013-11-26 2016-03-10 Xiaochuan Zhou Selective Amplification of Nucleic Acid Sequences
US20210277461A1 (en) * 2020-03-06 2021-09-09 Singular Genomics Systems, Inc. Linked paired strand sequencing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160068903A1 (en) * 2013-11-26 2016-03-10 Xiaochuan Zhou Selective Amplification of Nucleic Acid Sequences
US20210277461A1 (en) * 2020-03-06 2021-09-09 Singular Genomics Systems, Inc. Linked paired strand sequencing

Similar Documents

Publication Publication Date Title
US20220073909A1 (en) Methods and compositions for rapid nucleic library preparation
CN110191961B (en) Method for preparing asymmetrically tagged sequencing library
JP2020521486A (en) Single cell transcriptome amplification method
CN117778527A (en) Compositions and methods for identifying nucleic acid molecules
JP2016513461A (en) Prenatal genetic analysis system and method
JP2020517298A (en) Compositions and methods for library construction and sequence analysis
CA3128098A1 (en) Haplotagging - haplotype phasing and single-tube combinatorial barcoding of nucleic acid molecules using bead-immobilized tn5 transposase
EP3408406B1 (en) A novel y-shaped adaptor for nucleic acid sequencing and method of use
JP6924779B2 (en) Preparation of DNA sample by transposase random priming method
WO2019046768A1 (en) Symbolic squencing of dna and rna via sequence encoding
EP3004367A2 (en) Molecular barcoding for multiplex sequencing
US20170175182A1 (en) Transposase-mediated barcoding of fragmented dna
WO2020142631A2 (en) Quantitative amplicon sequencing for multiplexed copy number variation detection and allele ratio quantitation
US20220267848A1 (en) Detection and quantification of rare variants with low-depth sequencing via selective allele enrichment or depletion
WO2023220621A1 (en) Long-range dna sequencing through concatenating chimeric amplicon reads
WO2021222798A1 (en) Quantitative blocker displacement amplification (qbda) sequencing for calibration-free and multiplexed variant allele frequency quantitation
US20230250470A1 (en) Amplicon comprehensive enrichment
US20230340581A1 (en) Non-extensible oligonucleotides in dna amplification reactions
EP4330433A1 (en) Compositions and methods for chimeric amplicon formation
WO2024050553A1 (en) Methods for measuring telomere length

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23804471

Country of ref document: EP

Kind code of ref document: A1