WO2018031588A1 - Nucleic acid adaptors with molecular identification sequences and use thereof - Google Patents

Nucleic acid adaptors with molecular identification sequences and use thereof Download PDF

Info

Publication number
WO2018031588A1
WO2018031588A1 PCT/US2017/045976 US2017045976W WO2018031588A1 WO 2018031588 A1 WO2018031588 A1 WO 2018031588A1 US 2017045976 W US2017045976 W US 2017045976W WO 2018031588 A1 WO2018031588 A1 WO 2018031588A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
population
adaptor
stem
double
Prior art date
Application number
PCT/US2017/045976
Other languages
French (fr)
Inventor
Fang Sun
Dmitry GORYUNOV
Konstantinos Charizanis
John LANGMORE
Emmanuel Kamberov
Original Assignee
Takara Bio Usa, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Takara Bio Usa, Inc. filed Critical Takara Bio Usa, Inc.
Publication of WO2018031588A1 publication Critical patent/WO2018031588A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6853Nucleic acid amplification reactions using modified primers or templates
    • C12Q1/6855Ligating adaptors
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • Standarding DNA ends with additional short polynucleotide sequences is used in many areas of molecular biology such as whole genome amplification and sequencing.
  • Barcodes can be used to identify nucleic acid molecules, for example, where sequencing can reveal a certain barcode coupled to a nucleic acid molecule of interest.
  • a sequence-specific event can be used to identify a nucleic acid molecule, where at least a portion of the barcode is recognized in the sequence-specific event, e.g., at least a portion of the barcode can participate in a ligation or extension reaction.
  • the barcode can therefore allow identification, selection or amplification of DNA molecules that are coupled thereto.
  • gDNA genomic DNA
  • adaptors where at least one end of each fragment of genomic DNA is ligated to an adaptor including a barcode.
  • the ligated adaptors and gDNA fragments may be nick repaired, size selected, and amplified by PCR with primers directed to the adaptors to produce an amplified library.
  • ligation adaptors each including one of 16 different barcodes can be used to prepare 16 different gDNA samples, each with a unique barcode, such that either each sample can be amplified separately by PCR using the same PCR primers and then pooled (mixed together) or each sample can be pooled first and then simultaneously amplified using the same PCR primers.
  • bar codes there are many methods to use bar codes to tag samples, there are unmet needs to tag individual molecules within the input DNA sample.
  • UMI unique molecular identification
  • molecular tags are essential to identify and quantify different molecules in the same input sample that otherwise would be indistinguishable on the basis of sequence or other properties. Multiple sequences that have been amplified with the same molecular tag can be grouped together to remove artifacts that are created during the library amplification and sequencing processes.
  • double-stranded nucleic acid adaptors comprising molecular identifications sequences. Further provided are methods of using the double-stranded nucleic acid adaptors for generating nucleic acid libraries, e.g., for amplification and sequencing.
  • a first embodiment of the present disclosure provides a population of double- stranded nucleic acid adaptors for ligating to a population of nucleic acid target molecules, the double-stranded adaptors comprising a ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; a non-ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; and an asymmetric loop region between the ligateable stem region and the non-ligateable stem region, wherein the asymmetric loop region comprises a molecular identification sequence (MIS).
  • MIS molecular identification sequence
  • the double-stranded nucleic acid adaptors are further defined as a single-stranded nucleic acid molecule that under ligation conditions forms a stem-loop adaptor having a distal loop region attached to the non-ligateable stem region.
  • the distal loop region comprises a non-replicable base.
  • the non-replicable base comprises an abasic site.
  • the abasic site comprises an l',2'-dideoxyribose.
  • the non- replicable base comprises a deoxyduridine or a ribonucleotide base.
  • the non-ligateable stem region comprises a primer binding site.
  • the double-stranded adaptors within the population comprise a mixture of double- stranded adaptors with a first primer binding site and double-stranded adaptors with a second primer binding site.
  • the non-ligateable stem region comprises one or more mismatched bases.
  • the ligateable stem region further comprises a variable stem region defined as a region whose length varies among the members of the population.
  • the variable stem comprises 4-15 nucleotides.
  • the variable stem comprises 8-11 nucleotides.
  • the variable stem comprises 8, 9, 10, or 11 nucleotides.
  • the molecular identification sequence is unique to a subset of the population. In certain aspects, the molecular identification sequence is degenerate to a subset of the population. In further aspects, the molecular identification sequence is unique to a subset of the population and degenerate to another subset of the population. [0010] In certain aspects, the asymmetric loop region is formed between the terminal 3' end strand of the ligateable strand and the terminal 5' end strand of the non-ligateable strand. In other aspects, the asymmetric loop region is formed between the terminal 5' end strand of the ligateable strand and the terminal 3' end strand of the non-ligateable strand.
  • the double-stranded nucleic acid adaptors comprise DNA. In certain aspects, the double -stranded nucleic acid adaptors comprise RNA. In some aspects, the double- stranded nucleic acid adaptors comprise DNA and RNA.
  • the population of nucleic acid target molecules comprise genomic DNA, fragmented DNA, cDNA, amplified DNA, or a nucleic acid library.
  • a gap region on the strand opposite the asymmetric loop region has a length of at least one nucleotide shorter than that of the asymmetric loop region. In some aspects, the length of the gap region is at least 2 nucleotides shorter than that of the asymmetric loop region. In certain aspects, the length of the gap region is less than 5 nucleotides. In particular aspects, the length of the gap region is 1 nucleotide. In specific aspects, the gap region has a length of a bond between two adjacent nucleotides. In some aspects, the gap region comprises a spacer incapable of base- pairing. In certain aspects, the spacer comprises an abasic site. In particular aspects, the abasic site comprises an l',2'-dideoxyribose.
  • the asymmetric loop region is 4 to 16 nucleotides in length. In certain aspects, the asymmetric loop region is 5 to 8 nucleotides in length. In particular aspects, the asymmetric loop region is 6 nucleotides in length.
  • the molecular identification sequence comprises 5-10 nucleotides. In particular aspects, the molecular identification sequence comprises 6, 7, or 8 nucleotides. In specific aspects, the molecular identification sequence comprises 6 nucleotides. In some aspects, the molecular identification sequence is unique throughout the population. In further aspects, the molecular identification sequence is partially degenerate within the population. In certain aspects, the molecular identification sequence is degenerate within the population.
  • a 5' terminal end and/or 3' terminal end of the ligateable stem region comprise nucleotides having phosphorothioate linkages.
  • a 5' terminal end of the ligateable stem comprises a ligation block.
  • the ligation block is a dephosphorylated nucleotide, a 5' hydroxy 1 group, a dideoxy nucleotide, or an inverted dT.
  • the ligateable stem region further comprises one or more replication blocks or cleavable bases between a 3' terminal end or the 5' terminal end and the asymmetric loop or gap region.
  • the non-ligateable stem region further comprises one or more replication blocks or cleavable bases between the asymmetric loop or gap region and the distal loop region.
  • the cleavable base is inosine, uracil, or ribonucleotide.
  • a method for producing a library of adaptor- bound target nucleic acids comprising providing a population of target nucleic acid molecules and attaching to each end a double-stranded nucleic acid adaptor according to the embodiments, thereby generating a population of adapter-bound target nucleic acid molecules; and replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation to make an exact copy of the asymmetric loop.
  • the resultant library of adaptor-bound target nucleic acids has an MIS domain on each strand of the resultant double stranded molecule, which results in the ability to determine, during amplification, which strands came from the same original molecule.
  • This ability may increase sequencing accuracy and the ability to identify/alleviate bias as compared to conventional methods in which a UMI (i.e. unique molecular identifier) is attached to only on one strand.
  • a UMI i.e. unique molecular identifier
  • the UMI containing strand gets amplified and if errors arise, it is not possible to determine whether such errors arise from all Crick strand errors or also Watson strand errors.
  • bias may be identified and, if identified, compensation for such bias may be made.
  • the methods may include determining amplification, e.g., PCR bias, e.g., by counting molecules, based on two MIS sequences from the same double stranded molecule (or two complementary MIS sequences from the same double- stranded molecule).
  • the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
  • the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem region adjacent to the asymmetric loop.
  • a further embodiment provides a library of adaptor-bound target nucleic acids produced by the methods of the embodiments.
  • the population of adaptor-bound target nucleic acid molecules comprises a first double-stranded nucleic acid adaptor with a first primer binding site attached on one end and a second double-stranded nucleic acid adaptor with a second primer binding site attached on the other end.
  • attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang.
  • a method for producing a library of adaptor-bound target nucleic acids comprising providing a population of target nucleic acid molecules and attaching to one end a double-stranded nucleic acid adaptor according to the embodiments, thereby generating a population of adapter-bound target nucleic acid molecules; and displacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation, such that a complementary copy of the asymmetric loop is incorporated into the replaced strand.
  • the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
  • the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem adjacent to the asymmetric loop.
  • a further embodiment provides a library of adaptor-bound target nucleic acids produced by the methods of the embodiments.
  • attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang.
  • Another embodiment provides a method for producing a library of adaptor-bound target nucleic acids comprising providing a population of target nucleic acid molecules, attaching to one end a first double-stranded nucleic acid adaptor according to the embodiments, and attaching to the other end a second double-stranded nucleic acid adaptor optionally comprising a MIS, thereby generating a population of adapter-bound target nucleic acid molecules; and replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation such that a complementary copy of the second strand is incorporated into the replaced strand.
  • the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
  • the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the loop or in a region of the stem adjacent to the loop.
  • the second double-stranded nucleic acid adaptor does not comprise a MIS. In certain aspects, the second double-stranded nucleic acid adaptor does not comprise an asymmetric loop.
  • the first double-stranded nucleic acid adaptor and/or second double-stranded nucleic acid adaptor are stem-loop adaptors.
  • the first double- stranded nucleic acid adaptor comprises a first primer binding site in the non-ligateable stem region and the second double-stranded nucleic acid adaptor comprises a second primer binding site in the non-ligateable stem region.
  • attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang.
  • the method further comprises preventing MIS switching.
  • the method further comprises preventing MIS switching by contacting the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors with terminal deoxyribonucleotidyl transferase (TdT).
  • the method further comprises preventing MIS switching by performing PCR purification on the adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors.
  • the first or second double-stranded nucleic acid adaptors comprise one or more uracils within the ligateable stem region.
  • the method further comprises contacting the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors with a uracil excision reagent, e.g., a combination of Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII (such as the USERTM enzyme).
  • a uracil excision reagent e.g., a combination of Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII (such as the USERTM enzyme).
  • UDG Uracil DNA glycosylase
  • the method further comprises contacting the library and excess adaptors with exonuclease I and incubating at a non-denaturing temperature prior to performing PCR.
  • FIG. 1A-1D Schematic depicting designs of non-degradable bud adaptors (YB) comprising a molecular identification sequence (MIS) in an asymmetric loop, an abasic site (e.g., l',2'-Dideoxyribose (idSp)) opposite the MIS to facilitate correct adaptor folding, abasic sites in the distal loop to function as polymerase terminators, a variable-length stem to mediate ligation to nucleic acid molecules and provide base diversity, and phosphorothioate bonds (*) to prevent degradation through exonuclease activity and decrease adaptor dimer formation.
  • MIS molecular identification sequence
  • idSp l',2'-Dideoxyribose
  • IB Schematic depicting design of non-degradable bubble design (BB) comprising a MIS in a symmetric loop, abasic site(s) opposite the MIS, abasic sites in the main loop, and phosphorothioate bonds.
  • FIG. 1C Schematic depicting design of degradable bubble with a self-complementary byproduct (RB) formed by "fold-back" synthesis that replicates the non-degraded bubble.
  • FIG. ID Exemplary sequences of dephased stem region.
  • FIG. 3A-3B Data from a NextSeqTM sequencing run of libraries prepared with 6 different designs of bud adaptors with a 5' phosphorothioate bond. Sequencing was analyzed with the Illumina ® sample barcode trimmed and the additional adaptor sequence including the molecular barcode remaining. Run480 Trimmed was mapped to the human genome and metrics measured after manually trimming the entire molecular barcode and adaptor sequence.
  • FIG. 3B Sequencing results as Q scores of the individual nucleotides starting with the MIS, proceeding across the ligatable stem and into the gDNA region, using bud adaptors comprised of two different ligatable stem lengths.
  • FIG. 4 Amplification curves of libraries using the bud adaptor with or without a 5' phosphorothioate bond (5PT) and various stem lengths (e.g., 8, 10, and 12 nucleotides). (NTC: No template control).
  • FIG. 5A-5C Sequencing data from a run including bud adaptors with variable length stems.
  • FIG. 5B-5C Input titration of 4-stem dephased bud adaptors (trimmed (FIG. 5B) and untrimmed (FIG. 5C)).
  • a significant problem in next generation sequencing (NGS) is distinguishing duplicate sequencing reads.
  • Duplicates can be broadly categorized into two families: amplification duplicates and biological duplicates.
  • Amplification duplicates also known as PCR duplicates, arise as a consequence of amplification during library preparation prior to sequencing, or are generated on the flow cell of the sequencer.
  • Biological duplicates or molecular duplicates may be the result of the generation of two identical DNA or RNA fragments arising from biological means such as the generation of multiple short, identical mRNA molecules, or as a result of random fragmentation or enzymatic fragmentation of a number of copies of the genome.
  • individual source molecules may be distinguished from one another, as each will have different unique identifiers.
  • errors arising from PCR or sequencing can be informatically determined, and differentiated from true biological mutations. Both types of data can then be used for interpretation of sequencing data.
  • the present disclosure overcomes challenges associated with current technologies by providing double-stranded nucleic acid adaptors, such as stem-loop nucleic acid adaptors, with molecular identification sequences (i.e., MIS).
  • MIS molecular identification sequences
  • These double-stranded nucleic acid adaptors can be used to tag individual template fragments with an MIS to create libraries which can be amplified and sequenced.
  • the MIS can be used to identify, distinguish, and use duplicated sequences arising from both the biological source and subsequent amplifications.
  • the MIS allow for the differentiation between PCR duplicates and true biological duplicates, enabling PCR error correction and quantitative detection of low-frequency alleles with high statistical confidence.
  • the adaptors may comprise replication stops such as non-replicable bases which can be used to stop replication at specific locations in the adaptor.
  • the unreacted stem- loop adaptors are self-complementary and therefore usually unable to participate in unwanted priming reactions that might otherwise lead to MIS switching.
  • Adaptors provided herein may be degradable or arenon-degradable adaptors.
  • thedaptors have an MIS, which may be located in an asymmetric loop (also referred to herein as a "bud") on either stem strand.
  • the nucleotides between the loops may comprise a mismatched base.
  • Across from the asymmetric loop may be an unpaired gap region which can comprise only the bond between adjacent nucleotides, unpaired nucleotides, or at least one non-replicable base or spacer, e.g., to reduce barcode bias, allow for correct folding of the adaptor and/or to prevent collapse of the asymmetric loop structure.
  • non-replicable bases include, but are not limited to: abasic lesions (e.g., a tetrahydrofuran derivative, l',2'-Dideoxyribose (idSp), etc.); nucleotide adducts; iso- nucleotide bases (e.g., isocytosine, isoguanine, and the like), and any combination thereof, etc.
  • the non-replicable bases can also be present in the distal loop of the stem-loop adaptor, e.g., to function as polymerase terminators.
  • the stem-loop adaptor can also have phosphorothioate bonds on the 3' terminal end and/or the 5' terminal end to protect the adaptor from degradation by proofreading enzymes and prevent adaptor dimers to optimize the signal-to-noise ratio.
  • the adaptors can have variable stem lengths to provide sufficient base diversity in the library at the beginning of the read for cluster detection and intensity correction, such as by RTA software, without the use of a control nucleic acid library, such as the PhiX control nucleic acid library.
  • the variable stems can add further unique data which can add to the level of barcoding, e.g., where the variable stem may, where desired, be employed in combination with an MIS domain to provide a unique molecular barcode. Diversity in the stems allows for low amplification background and the generation of fewer unmapped reads during sequencing.
  • the variable stem lengths also provide extra unique sequence information for informatic analysis of sequencing data.
  • a population of adaptors may include a common barcode domain, e.g., a region or sequence of nucleic acids that is common or the same among the population of adaptors and serves as a barcode or identifier for a source of target nucleic acids to which the adaptors are ligated during use.
  • a barcode domain may serve as an identifier of a sample from which the target nucleic acids are obtained, such that it may be viewed as a sample barcode.
  • the barcode domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
  • the barcode domain may be combined with the MIS domain, e.g., such that the adaptors include a barcode/MIS domain.
  • a barcode/MIS domain is made up of a series of interspersed barcode and MIS bases.
  • interspersed is meant that the bases which are barcode bases (i.e., the bases that collectively make up the barcode component of a barcode/MIS domain) are distributed or positioned among MIS bases (i.e., the bases that collectively make up the MIS domain of a barcode/MIS domain).
  • a given barcode/MIS domain is one that includes at least one MIS base positioned adjacent to at least one barcode base, where in those instances in which the barcode/MIS domain is made up of 3 or more bases, at least two bases of a first type (e.g., MIS or barcode) may be separated by at least one base of another type (e.g., MIS or barcode).
  • the length of a given barcode/MIS domain may vary, ranging in some instances from 4 to 50 nts, where in some instances the length ranges from 5 to 25 nts, e.g., 6 to 20 nts, where specific lengths of interest include, but are not limited to: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 nts.
  • the barcode/MIS domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
  • the present disclosure provides methods for the production of libraries using the nucleic acid adaptors disclosed herein which are compatible with the major sequencing platforms including, but not limited to, Illumina's MiSeq TM , NextSeq, and HiSeq TM using conventional flow cells. Libraries are also compatible with hybridization capture target enrichment platforms, such as those manufactured by Agilent and NimbleGen. Libraries produced with the non-degradable bud stem-loop adaptors performed well across a wide target input range, including low inputs which can be difficult to amplify and sequence.
  • the template fragments can include DNA, such as cell-free DNA (cfDNA) isolated from plasma, urine, or cerebrospinal fluid samples or isolated genomic DNA (gDNA) which has been subjected to fragmentation, or RNA.
  • cfDNA cell-free DNA
  • gDNA isolated genomic DNA
  • RNA isolated genomic DNA
  • the template fragments are end repaired and at least one end of each template fragment is ligated to a stem-loop adaptor.
  • the template fragment may have an adaptor with a MIS on one end and an adaptor without an MIS on the other end.
  • the template fragment may have an adaptor with an MIS on both ends.
  • a stem-loop adaptor comprising a first primer binding site in the non- ligateable stem is ligated to one end of a template nucleic acid and a stem-loop adaptor comprising a second primer binding site in the non-ligateable stem is ligated to the other end of the template nucleic acid.
  • the two stem-loop adaptors may comprise different sequences or comprise a common sequence to promote suppression of amplification of short ligation products, i.e., adaptor dimers or short inserts that will be suppressed to reduce background from the unwanted adaptor dimers or short gDNA fragments.
  • the use of two stem-loop adaptors prevents suppression of amplification that is inherent to the use of a single adaptor, which facilitates sequencing on sequencing platforms such as those manufactured by Illumina.
  • the suppression prevents molecules which have two copies of the first adapter with the first primer binding site or two copies of the second adapter with the second primer binding site from amplification.
  • MIS switching is non-specific replacement of a MIS, originally assigned to a target molecule by the attachment chemistry of a library preparation. MIS switching may occur during PCR amplification, where residual adaptors or their byproducts carried over into the PCR step act as primers that randomly replace the original MIS with a different (non-specific) MIS.
  • TdT terminal deoxyribonucleotidyl transferase
  • TdT is then inactivated and will not interfere with PCR.
  • AMPure ® clean-up can be used after ligation and prior to PCR to remove unreacted adaptors.
  • Another method to reduce MIS switching involves inactivation of unligated stem-loop adaptors.
  • the stem-loop adaptor comprises one or more uracil residues within the terminal 5' end strand of the ligateable stem region or within the terminal 3' end strand of the non-ligateable stem region which are converted to abasic sites and degraded by a suitable enzyme, such as USERTM (Uracil-Specific Excision Reagent), to produce several short, single-stranded oligonucleotide products from the terminal 5' end strand of the ligateable stem region or the terminal 3' end strand of the non- ligateable stem region as well as an intact single-stranded oligonucleotide from the terminal 3' end strand of the ligateable stem region or the terminal 5' end strand of the non-ligateable stem region.
  • Exonuclease I is then added to the PCR premix which is incubated a temperature too low to denature the DNA; thus, the short fragments from the unreacted stem-loop adaptors and adaptor dimers are degrade
  • MIS switching is reduced by adding one or more uracils, such as 2, 3, or 4 uracils, near the 3' terminal end of the ligateable stem region of a first adaptor ⁇ e.g., the adaptor with a first primer binding site) having the MIS.
  • the second adaptor does not have a MIS and has the second primer binding site.
  • the residual activities of the repair enzymes will extend both 3' ends of the gDNA to make copies of the 5' ends of the adaptors even in the presence of uracil in one of the adaptors.
  • the 5' end of the second adaptor Upon addition of the PCR polymerase, the 5' end of the second adaptor will not be replicated due to the uracil in the template; however, the 5' end of the first adaptor will be replicated to make a copy of the MIS.
  • One strand will be replicated normally by PCR as it does not have uracil at its 5' end, and the second strand will not be PCR amplified as the PCR polymerase cannot read through uracil.
  • This method of making a nucleic acid library from one of the two strands overcomes challenges associated with duplex sequencing. Duplex sequencing requires enough reads from the forward and reverse strand to get accurate sequencing from both.
  • the sequencing resources can instead be devoted to sequencing the single strand more deeply or sequencing a strand from a different gDNA molecule.
  • this method can be used in cases where very deep sequencing is not possible or when small insertions or deletions and translocations are being detected.
  • un-ligated adaptor is removed by a 3 ' exonuclease active on both blunt and recessed 3' ends (e.g., E. coli exonuclease III), or a combination of a 5' exonuclease active on blunt and recessed 5' ends (e.g., E. coli exonuclease VIII) and a 3 ' exonuclease active on 3' protruding ends (e.g., exonuclease T).
  • the 5 ' exonuclease can expose a 3 ' extension that can be substrate for the 3' exonuclease.
  • adaptors of the disclosure when fully ligated they may not have any free ends and therefore may be protected from cleavage by the exonucleases.
  • a 3 '-protected, un-extendable blocker oligonucleotide can be added to the amplification reaction.
  • the blocker oligonucleotide can be at 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more nucleotides in length.
  • the blocker oligonucleotide can, in some instances, be at most 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more nucleotides in length.
  • the blocker oligonucleotide can be fully complementary to the 3' end of the 3 ' strand of the adaptor.
  • the blocker oligonucleotide is fully complementary to the adaptor stem sequence (e.g., dephase stem). An excess of such blocker may prevent priming by the free adaptor yet may not prevent priming by the PCR primer, which does not contain the stem sequence and is therefore still able to anneal to and prime target nucleic acid molecules.
  • the 3' terminal end of the ligateable stem region of the double- stranded nucleic acid adaptor is ligated to the 5' phosphate of the target fragment leaving a nick between the 3' end of target fragment and the 5' terminal end of the ligateable region of the double- stranded nucleic acid adaptor.
  • Polymerase extension is then performed on the adaptor-bound template by extending the 3' end of the template fragment end toward the end of the double-stranded nucleic acid adaptor, copying the molecular identification sequence, during strand displacement or nick translation.
  • the adaptor-bound target fragments are then amplified to create libraries and may then be sequenced.
  • the MIS adaptors described herein can be used for amplification and/or sequencing, such as to generate nucleic acid libraries for sequencing. I. Definitions
  • Nucleotide is a term of art that refers to a base-sugar-phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e., of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
  • ribonucleotide triphosphates such as rATP, rCTP, rGTP, or rUTP
  • deoxyribonucleotide triphosphates such as dATP, dCTP, dUTP, dGTP, or dTTP.
  • a "nucleoside” is a base-sugar combination, i.e. , a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide.
  • the nucleotide deoxyuridine triphosphate, dUTP is a deoxyribonucleoside triphosphate. After incorporation into DNA, it serves as a DNA monomer, formally being deoxyuridylate, i.e. , dUMP or deoxyuridine monophosphate.
  • dUTP is a base-sugar combination
  • dUTP is a deoxyribonucleoside triphosphate.
  • dUMP deoxyuridine monophosphate.
  • one may say that one incorporates deoxyuridine into DNA even though that is only a part of
  • nucleic acid or “polynucleotide” will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. adenine "A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil "U” and C).
  • nucleobase such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. adenine "A,” guanine “G,” thymine “T” and cytosine "C”) or RNA (e.g. A, G, uracil "U” and C).
  • nucleic acid encompasses the terms “oligonucleotide” and “polynucleotide.”
  • oligonucleotide refers to at least one molecule of between about 3 and about 100 nucleobases in length.
  • polynucleotide refers to at least one molecule of greater than about 100 nucleobases in length.
  • a nucleic acid may encompass at least one double-stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or "complement(s)" of a particular sequence comprising a strand of the molecule.
  • a single stranded nucleic acid may be denoted by the prefix "ss”, a double-stranded nucleic acid by the prefix "ds”, and a triple stranded nucleic acid by the prefix "ts.”
  • nucleic acid molecule or “nucleic acid target molecule” refers to any single- stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof.
  • the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2' -deoxyribose group.
  • the nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA.
  • mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase.
  • a nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc.
  • a nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc.
  • a nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc.
  • a nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
  • Analogous forms of purines and pyrimidines are well known in the art, and include, but are not limited to aziridinylcytosine, 4-acetylcytosine, 5-fluorouracil, 5-bromouracil, 5- carboxymethylaminomethyl-2-thiouracil, 5-carboxymethylaminomethyluracil, inosine, N6- isopentenyladenine, 1 -methyladenine, 1-methylpseudouracil, 1 -methylguanine, 1-methylinosine, 2,2- dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N.sup.6- methyladenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5-methoxyuracil, 2-methylthio-N6-isopentenyladen
  • the nucleic acid molecule can also contain one or more hypermodified bases, for example and without limitation, 5- hydroxymethyluracil, 5-hydroxyuracil, a-putrescinylthymine, 5-hydroxymethylcytosine, 5- hydroxycytosine, 5-methylcytosine,— methyl cytosine, 2-aminoadenine, acarbamoylmethyladenine, N' -methyladenine, inosine, xanthine, hypoxanthine, 2,6-diaminpurine, and ⁇ 7 -methylguanine.
  • hypermodified bases for example and without limitation, 5- hydroxymethyluracil, 5-hydroxyuracil, a-putrescinylthymine, 5-hydroxymethylcytosine, 5- hydroxycytosine, 5-methylcytosine,— methyl cytosine, 2-aminoadenine, acarbamoylmethyladenine, N' -methyladenine, inosine, xanthine
  • the nucleic acid molecule can also contain one or more non-natural bases, for example and without limitation, 7 -deaza-7 -hydroxymethyladenine, 7 -deaza-7- hydroxymethylguanine, isocytosine (isoC), 5-methylisocytosine, and isoguanine (isoG).
  • non-natural bases for example and without limitation, 7 -deaza-7 -hydroxymethyladenine, 7 -deaza-7- hydroxymethylguanine, isocytosine (isoC), 5-methylisocytosine, and isoguanine (isoG).
  • the nucleic acid molecule containing only canonical, hypermodified, non-natural bases, or any combinations the bases thereof can also contain, for example and without limitation where each linkage between nucleotide residues can consist of a standard phosphodiester linkage, and in addition, may contain one or more modified linkages, for example and without limitation, substitution of the non-bridging oxygen atom with a nitrogen atom (i.e., a phosphoramidate linkage, a sulfur atom (i.e., a phosphorothioate linkage), or an alkyl or aryl group (i.e., alkyl or aryl phosphonates), substitution of the bridging oxygen atom with a sulfur atom (i.e., phosphorothiolate), substitution of the phosphodiester bond with a peptide bond (i.e., peptide nucleic acid or PNA), or formation of one or more additional covalent bonds (i.e., locked nucleic acid or LNA), which has an
  • Nucleic acid(s) that are “complementary” or “complement(s)” are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules.
  • the term “complementary” or “complement(s)” may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above.
  • substantially complementary may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase.
  • a "substantially complementary" nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base - pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
  • the term “substantially complementary” refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent conditions.
  • a “partially complementary” nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
  • Oligonucleotide refers collectively and interchangeably to two terms of art, “oligonucleotide” and “polynucleotide.” Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein.
  • the term “adaptor” may also be used interchangeably with the terms “oligonucleotide” and “polynucleotide.”
  • Amplification refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 30-100 "cycles" of denaturation and replication. [0064] “Polymerase chain reaction,” or “PCR,” means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA.
  • PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates.
  • the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument.
  • Primer means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed.
  • the sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide.
  • primers are extended by a DNA polymerase.
  • Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in the range of between 18-40, 20-35, 21-30 nucleotides long, and any length between the stated ranges.
  • Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges.
  • the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.
  • stem-loop oligonucleotide refers to a structure formed by an oligonucleotide comprised of 5' and 3' terminal regions, which are intramolecular inverted repeats that form a double-stranded stem, and a non-self-complementary central region, which forms a single -stranded loop.
  • the stem-loop oligonucleotide further comprises a second or third single-stranded loop, such as within the 5' stem and/or the 3' stem.
  • An "asymmetric loop” refers to a single-stranded loop on only one stem strand with a "gap region" of unpaired bases across from the asymmetric loop.
  • non-complementary refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.
  • “Cleavable base,” as used herein, refers to a nucleotide that is generally not found in a sequence of DNA.
  • deoxyuridine is an example of a cleavable base.
  • dUTP triphosphate form of deoxyuridine
  • the resulting deoxyuridine is promptly removed in vivo by normal processes, e.g., processes involving the enzyme uracil-DNA glycosylase (UDG) (U.S. Patent No. 4,873,192; Duncan, 1981; both references incorporated herein by reference in their entirety).
  • deoxyuridine occurs rarely or never in natural DNA.
  • Non-limiting examples of other cleavable bases include deoxyinosine, bromodeoxyuridine, 7-methylguanine, 5,6-dihyro-5,6 dihydroxydeoxythymidine, 3- methyldeoxadenosine, etc. (see, Duncan, 1981).
  • Other cleavable bases will be evident to those skilled in the art.
  • degenerate refers to a nucleotide or series of nucleotides wherein the identity can be selected from a variety of choices of nucleotides, as opposed to a defined sequence. In specific embodiments, there can be a choice from two or more different nucleotides. In further specific embodiments, the selection of a nucleotide at one particular position comprises selection from only purines, only pyrimidines, or from non-pairing purines and pyrimidines.
  • non-replicable base refers to a position at which polymerization ceases.
  • the non-replicable base or sequence may comprise an abasic site or sequence, hexaethylene glycol, and/or a bulky chemical moiety attached to the sugar-phosphate backbone or the base.
  • an "abasic site” lacks a base at a position in the oligonucleotide, i.e., the sugar residue is present at the position in the probe, but the purine or pyrimidine (nucleobase) group has been removed or replaced.
  • One or more abasic sites may become incorporated into one or more locations in an oligonucleotide.
  • ligase refers to an enzyme that is capable of joining the 3' hydroxyl terminus of one nucleic acid molecule to a 5' phosphate terminus of a second nucleic acid molecule to form a single molecule.
  • the ligase may be a DNA ligase or RNA ligase.
  • DNA ligases include E. coli DNA ligase, T4 DNA ligase, and mammalian DNA ligases.
  • MIS molecular identifier sequence(s)
  • a MIS can be added to a target nucleic acid by including the sequence in the adaptor to be ligated to the target.
  • a MIS can also be added to a target nucleic acid of interest during amplification by carrying out reverse transcription with a primer that contains a region comprising the barcode sequence and a region that is complementary to the target nucleic acid such that the barcode sequence is incorporated into the final amplified target nucleic acid product (i.e., amplicon).
  • the MIS may be any number of nucleotides of sufficient length to distinguish the MIS from other MIS.
  • a MIS may be anywhere from 4 to 20 nucleotides long, such as 5 to 11, or 12 to 20.
  • the MIS has a length of 6 random nucleotides.
  • the term "molecular identifier sequence,” "MIS,” “unique molecular identifier,” “UMI,” “molecular barcode,” “molecular identifier sequence”, “molecular tag sequence” and “barcode” are used interchangeably herein.
  • sample means a material obtained or isolated from a fresh or preserved biological sample or synthetically -created source that contains nucleic acids of interest.
  • a sample is the biological material that contains the variable immune region(s) for which data or information are sought.
  • Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
  • the present disclosure provides synthetic oligonucleotides which form double-stranded adaptors for use in the generation of nucleic acid libraries.
  • the double-stranded adaptors are stem-loop adaptors comprising a distal loop.
  • the synthetic oligonucleotides which form the double-stranded adaptors can have a length of 20 to 100 nucleotides, such as 50 to 80 nucleotides, such as between 60 and 70 nucleotides.
  • Exemplary structures of the double-stranded nucleic acid adaptors, such as a bud adaptor are provided in FIGs. 1A-1C.
  • the synthetic oligonucleotides which form a bud adaptor comprise a double- stranded ligateable stem region and a double stranded non-ligateable stem region, separated by a bud (also referred to herein as an asymmetric loop region).
  • Each double-stranded region has a 5' end stem strand and a 3' end stem strand. The 3' end and the 5' end can form a blunt end or a staggered end.
  • the double-stranded regions have blunt ends.
  • the asymmetric loop i.e., bud
  • MIS molecular identification sequence
  • the double-stranded nucleic acid adaptor may further comprise a gap region on the strand opposite of the bud.
  • the gap region may only comprise the bond between adjacent nucleotides or a region of non-paired nucleotides.
  • the gap region can be on either strand between the ligateable and non-ligateable stem regions.
  • the gap region comprises a non-replicable base, such as an abasic site or spacer.
  • the double-stranded nucleic acid adaptor mayfurther comprises the non-ligateable stem region which in a stem-loop will be between the distal loop region and the asymmetric loop region.
  • this region comprises one or more mismatched bases.
  • the double-stranded nucleic acid adaptor may further comprise a primer binding site with a known sequence.
  • the primer binding site may be located in the non-ligateable stem region or the distal loop.
  • a forward primer binding site is located between the gap region and distal loop
  • a reverse primer binding site is located between the asymmetric loop and the main loop.
  • the adaptor may comprise flow cell binding sequences, such as P5 and/or P7, or fragments thereof.
  • a first adaptor comprises a P5 sequence and a second adaptor comprises a P7 sequence.
  • the adaptor can comprise part or all of sequencing primer sequences or their binding sites such as index sequencing primers for particular sequencing platforms (e.g., Illumina index primers).
  • an adaptor may include a barcode domain, e.g., a region or sequence of nucleic acids that serves as a barcode or identifier for a source of target nucleic acids to which the adaptor is ligated during use.
  • a barcode domain may serve as an identifier of a sample from which the target nucleic acids are obtained, such that it may be viewed as a sample barcode.
  • the barcode domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
  • the barcode domain may be combined with the MIS domain, e.g., such that the adaptors include a barcode/MIS domain.
  • a barcode/MIS domain is made up of a series of interspersed barcode and MIS bases.
  • interspersed is meant that the bases which are barcode bases (i.e., the bases that collectively make up the barcode component of a barcode/MIS domain) are distributed or positioned among MIS bases (i.e., the bases that collectively make up the MIS domain of a barcode/MIS domain).
  • a given barcode/MIS domain is one that includes at least one MIS base positioned adjacent to at least one barcode base, where in those instances in which the barcode/MIS domain is made up of 3 or more bases, at least two bases of a first type (e.g., MIS or barcode) may be separated by at least one base of another type (e.g., MIS or barcode).
  • the length of a given barcode/MIS domain may vary, ranging in some instances from 4 to 50 nts, where in some instances the length ranges from 5 to 25 nts, e.g., 6 to 20 nts, where specific lengths of interest include, but are not limited to: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 nts.
  • the barcode/MIS domain may be positioned at any convenient location in the adaptor, such as the non- ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
  • Embodiments of the methods described herein employ populations or collections of stem loop adaptors, e.g., as described above. Such populations or collections may be made up of a plurality of different, i.e., distinct, step loop adaptors that differ from each other in terms of sequence.
  • the plurality of different stem loop adaptors is made up of adaptors that have regions of common sequence (e.g., the stem regions, the loop regions) and regions of differing sequence (e.g., the MIS containing regions, the dephased stem region, etc.).
  • a population of stem loop adaptors may be made up of stem loop adaptors in which the only region that differs among the population is the MIS containing region, e.g., the disparate distinct adaptor members of the population only differ from each other in terms of their MIS sequences.
  • the number of distinct stem loop adaptors in a given population that is employed in embodiments of the invention may vary, where in some instances the amount is 10 or more, such as 50 or more, 100 or more, 500 or more, 1,000 or more, 5,000 or more, 10,000 or more, 50,000 or more, 100,000 or more, 250,000 or more, 500,000 or more, 1,000,000 or more, 5,000,000 or more, 10,000,000 or more, 20,000,000 or more, where in some instances the number is 50,000,000 or less, such as 25,000,000 or less, including 20,000,000 or less, where in some instances the number is 10,000,000 or less, 5,000,000 or less, 1,000,000 or less, 500,000 or less, 100,000 or less, including 50,000 or less.
  • a molecular identification sequence within the stem-loop adaptors, particularly in an asymmetric loop of the adaptor, allows for the tagging of individual source molecules for subsequent informatic analysis, and provides diversity and balance to analyze samples of high complexity.
  • the barcode or molecular identifier sequence within the asymmetric loop can have a length of 4 to 15 nucleotides, such as 5 to 10 nucleotides, such as 5, 6, 7, 8, 9, or 10 nucleotides.
  • the asymmetric loop has a length of 6 nucleotides resulting in 16.8xl0 6 total possible combinations of MIS adaptors within a library.
  • the random barcode sequence is generated by using a mixture of A, G, C, and/or T for incorporation of nucleotides into the MIS of the double-stranded nucleic acid adaptor.
  • the gap region across from the asymmetric loop has a length at least 1 nucleotide less than the length of the asymmetric loop. In some aspects, the gap region is at least 2, or up to at least 5, nucleotides shorter than the length of the asymmetric loop.
  • an adaptor with an asymmetric loop of 6 nucleotides would have a gap region of less than 6 nucleotides, such as 5, 4, 3, 2, 1, or 0 (e.g., nucleotide bond) nucleotides in length.
  • the gap region has a length of 1 nucleotide, such as one non-replicable base, particularly one abasic site.
  • barcodes may be employed. Barcoding is described, e.g. , in U.S. Pat. 7,902, 122. Methods of using stem loop adaptor ligation and primer extension or PCR to add additional sequences are described, e.g. , in U.S. Pat. 7,803,550, which is incorporated by reference herein in its entirety. Barcode incorporation by primer extension, for example via PCR, may be performed using methods described in U.S. 5,935,793 or US 2010/0227329. In some embodiments, a barcode may be incorporated into a nucleic acid via using ligation, which can then be followed by amplification; for example, methods described in U.S. Pat. 5,858,656, U.S.
  • U.S. Pat. Publn. 2011/0319290, or U.S. Pat. Publn. 2012/0028814 may be used with the present invention.
  • one or more barcode may be used, e.g. , as described in U.S. Pat. Publn. 2007/0020640, U.S. Pat. Publn. 2009/0068645, U.S. Pat. Publn. 2010/0273219, U.S. Pat. Publn. 2011/0015096, or U.S. Pat. Publn. 2011/0257031.
  • the double-stranded nucleic acid adaptors further comprise a variable length stem region in the ligateable stem region between the terminal end (e.g., the 5' terminal end and/or the 3' terminal end) and the bud or gap region.
  • the variable stem provides sufficient diversity in the library at the beginning of the read for cluster detection and intensity correction without a control nucleic acid library, such as the PhiX control nucleic acid library.
  • the variable stems also provide more unique information for distinguishing between sequences bioinformatically.
  • the variable stems (also referred to herein as the dephased stems) can differ by a single nucleotide (e.g., FIG.
  • a population of double-stranded nucleic acid adaptors can comprise a mixture of double-stranded nucleic acid adaptors having a stem length of n, n+1, n+2, and n+3, such as 8, 9, 10, and 11.
  • the variable stems have a length n between 3-20 nucleotides, particularly 6-15 nucleotides, such as 6, 7, 8, 9, 10, 11, or 12 nucleotides.
  • dephased stems are described in Lundberg et al., 2013, and Wu et al., 2015.
  • variable-length stem sequences within one subset of double-stranded nucleic acid adaptors include TGAGCTAC, TGAGCTACT, TGAGCTACTG, and TGAGCTACTGA as well as the sequences disclosed in FIG. ID.
  • the terminal end such as the 5' end, comprises a ligation block.
  • the ligation block is a dephosphorylated nucleotide, a 5' hydroxy 1, or an inverted base, such as inverted dT.
  • a 5' end of the double-stranded nucleic acid adaptor oligonucleotide lacks a phosphate.
  • the 5 'end and/or 3 'end has at least one phosphorothioate bond.
  • the phosphorothioate bond can protect the adaptor from degradation by proofreading enzymes (e.g., 5 '-3' exonuclease) and prevent unwanted ligation products or adaptor dimers.
  • the double-stranded nucleic acid adaptor has a phosphorothioate modification on the last 2 bases of the 3' terminal end, and the 1st base of the 5' terminal end to deter adapter dimer formation and optimize the signal-to-noise ratio.
  • exonuclease resistant modifications may include phosphorodithioates, methyl phosphonates and 2'-0-methyl sugars, either separately or in combination.
  • a number of other modifications are known to reduce the exonuclease degradation of single DNA strands, including phosphoramidites (P-NR2), phosphorofluoridates (P-F), boranophosphanes (P-BH3) or phosphoroselenoates (P-Se), and modifications to the sugar rings, such as 2'-0 alkyl groups, 2'-fluoro groups, 2' -amino groups such as 2-amino propyl.
  • the double-stranded nucleic acid adaptor comprises a replication stop or non-replicable base.
  • the gap region may comprise a non- replicable base or spacer, such as an abasic site or cleavable base.
  • the distal loop of the stem-loop oligonucleotide adaptor may comprise a non-replicable base or spacer, such as an abasic site or cleavable base.
  • the replication stop may be at the 5' end of the stem, the 3' end of the stem, or proximal to the distal loop.
  • the non-replicable base can function as a polymerase terminator and facilitates correct adaptor folding. Correct adaptor folding, facilitated by the use of non-replicable bases, also prevents spurious priming by excess stem loop adaptors as the folded, stem-loop conformation is thermodynamically favored rather than hybridization to library molecules.
  • the adaptor may comprise at least 2, 3, 4, 5, 6 or more non-replicable bases depending on the length of the adaptor.
  • Non-replicable bases include, but are not limited to, l',2'-dideoxyribose (idSp), and deoxyuridine.
  • Cleavable bases include, but are not limited to: uracil, inosine or a ribonucleotide.
  • spacer means a hydrocarbon residue with preferably one to six carbon atoms, preferably an alkdiyl group with 2 to 4 carbon atoms, most preferred linear C3 (5'-C3-spacer).
  • Double-stranded nucleic acid adaptors comprising cleavable bases can be cleaved by enzymes or chemical reagents.
  • cleaving agents include DNA repair enzymes, glycosylases, DNA cleaving endonucleases, ribonucleases and silver nitrate.
  • cleavage at dU may be achieved using uracil DNA glycosylase and endonuclease VIII (USERTM, NEB, Ipswich, Mass.) (U.S. Pat. No. 7,435,572).
  • the modified nucleotide is a ribonucleotide
  • the adapter can be cleaved with an endoribonuclease.
  • Abasic sites can be recognized and cleaved by AP endonucleases and/or AP lyases.
  • Class II AP endonucleases cleave at AP sites to leave a 3' OH that can be used in polynucleotide polymerization.
  • AP endonucleases can remove moieties attached to the 3' OH that inhibit polynucleotide polymerization. For example a 3' phosphate can be converted to a 3' OH by E. coli endonuclease IV.
  • AP endonucleases can work in conjunction with glycosylases.
  • FIGS. 1A-1D provide depictions of illustrative embodiments of MIS containing adaptors according to certain embodiments of the invention.
  • a non-degradable bud adaptor is illustrated, where the adaptor comprises a molecular identification sequence (MIS) in an asymmetric loop (or bud), an abasic site ⁇ e.g., 1 ',2'-Dideoxyribose (idSp)) opposite the MIS to facilitate correct adaptor folding, abasic sites in the distal loop to function as polymerase terminators, a variable ength stem to mediate ligation to nucleic acid molecules and provide base diversity, and phosphorothioate bonds (*) to prevent degradation through exonuclease activity and decrease adaptor dimer formation.
  • MIS molecular identification sequence
  • idSp 1 ',2'-Dideoxyribose
  • FIG. IB provides a schematic depicting a non-degradable bubble adaptor (BB) comprising a MIS in a symmetric loop, abasic site(s) opposite the MIS, abasic sites in the main loop, and phosphorothioate bonds.
  • FIG. 1 C provides a schematic depicting a degradable bubble with a self- complementary byproduct (RB).
  • FIG. ID provides exemplary sequences of de-phased stem regions that may be present in adaptors of the invention.
  • Double-stranded nucleic acid adaptors and stem-loop oligonucleotides can be used as adaptors for preparing libraries for whole genome or whole transcriptome amplification for PCR analysis, microarray analysis, conventional Sanger or next generation sequencing, e.g., as described in U.S. Pat. No. 7,803,550.
  • a whole genome is amplified from a single cell.
  • a method of preparing a library of nucleic acid molecules For example, libraries generated by DNA fragmentation and addition of a stem-loop adaptor to one or both DNA ends may be used to amplify (by PCR) and sequence DNA regions adjacent to a previously established DNA sequence (see, for example, U.S. Patent No. 6,777,187 and references therein, all of which are incorporated by reference herein in their entirety).
  • the double-stranded nucleic acid adaptor can be ligated to the 5' end, the 3' end, or both strands of DNA.
  • a plurality of nucleic acid molecules are amplified and sequenced by ligating the plurality of nucleic acid molecules to a population, e.g., as described above, of double-stranded nucleic acid adaptors.
  • One method comprises obtaining a population of target nucleic acid molecules and attaching at least one end of a double-stranded nucleic acid adaptor to at least one end of the target nucleic acid molecule and displacing one strand of the adaptor bound oligonucleotide by strand displacement or nick translation.
  • a bud stem-loop adaptor comprising a MIS is ligated to both ends of the target nucleic acid.
  • a bud stem-loop adaptor comprising a MIS is ligated to one end of the target nucleic acid and a stem-loop adaptor not comprising a MIS is ligated to the other end of the target nucleic acid.
  • the two adaptors ligated to each end of a target nucleic acid may comprise part or all of a first sequencing primer sequence or a second sequencing primer sequence, such that an adaptor on one end has part or all of a first sequencing primer sequence and an adaptor on the other end has part or all of a second sequencing primer sequence.
  • the adaptor may be ligated to one strand (i.e., single-stranded ligation) or to both strands (i.e., double-stranded ligation) of the target nucleic acid.
  • the target nucleic acid is a double-stranded DNA molecule.
  • the double-stranded DNA may be any type of DNA (or sub-type thereof) including, but not limited to, genomic DNA (e.g., prokaryotic genomic DNA (e.g., bacterial genomic DNA, archaea genomic DNA, etc.), eukaryotic genomic DNA (e.g., plant genomic DNA, fungi genomic DNA, animal genomic DNA (e.g., mammalian genomic DNA (e.g., human genomic DNA, rodent genomic DNA (e.g., mouse, rat, etc.), etc.), insect genomic DNA (e.g., drosophila), amphibian genomic DNA (e.g., Xenopus), etc.)), viral genomic DNA, mitochondrial DNA, cell-free DNA, such as NIPT DNA, including fetal and/or maternal cell free DNA, or any combination of DNA types thereof or subtypes thereof.
  • genomic DNA e.g., prokaryotic genomic DNA (e.g.,
  • the method comprises attaching an adaptor to complementary single strands of the double-stranded DNA molecule.
  • the plurality of genomic DNA molecules are enzymatically digested or randomly fragmented to produce DNA fragments, a MIS stem-loop adaptor is ligated to at least one end of a plurality of the DNA fragments to produce adaptor-linked fragments, and the adaptor-linked fragments are then amplified.
  • the target nucleic acids are isolated from cell-free DNA (cfDNA), e.g., where the DNA is an NIPT DNA sample.
  • the isolated cfDNA may comprise fragments (e.g., of about 50 to 200 bp, particularly about 167 bp in length) and not need a fragmentation step prior to library preparation.
  • a MIS double-stranded nucleic acid adaptor may be coupled to one end of a target nucleic acid molecule or to both ends of a target nucleic acid molecule.
  • the double-stranded nucleic acid adaptor may be coupled to the nucleic acid molecule via ligation to the 5' end of the nucleic acid molecule, for example, by blunt-end ligation. Ligating the double-stranded nucleic acid adaptor to one or both ends of a target nucleic acid molecule may result in nick formation. Said one or more nicks may be removed from the ligated double-stranded nucleic acid adaptor and the nucleic acid target molecule.
  • the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxy 1 group and strand displacement or nick translation polymerization is performed to extend the nucleic acid molecules to the adaptor.
  • the polymerization may cease at a non-replicable base, such as within the gap region.
  • polymerization may cease in the region between loops, and/or the main loop.
  • an extension reaction may extend the 3' end of the nucleic acid molecule through the stem-loop adaptor where the loop portion is cleaved at a cleavable replication stop.
  • methods of the present invention utilize a strand-displacing polymerase, such as ⁇ 29 Polymerase, Bst Polymerase, Vent Polymerase, 9oNm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, AMV reverse transcriptase, HIV reverse transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 '-5 ' exonuclease activity, or a mixture thereof.
  • a strand-displacing polymerase such as ⁇ 29 Polymerase, Bst Polymerase, Vent Polymerase, 9oNm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, AMV reverse transcriptase, HIV reverse transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 '-5 ' exonuclease activity, or a mixture thereof.
  • Nucleic acids in a nucleic acid sample being analyzed (or processed) in accordance with the present invention can be from any nucleic acid source, e.g., as described above.
  • nucleic acids in a nucleic acid sample can be from virtually any nucleic acid source, including but not limited to genomic DNA, complementary DNA (cDNA), RNA (e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.), plasmid DNA, mitochondrial DNA, cfDNA, etc.
  • genomic DNA complementary DNA
  • RNA e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.
  • plasmid DNA mitochondrial DNA
  • cfDNA mitochondrial DNA
  • any organism can be used as a source of nucleic acids to be processed in accordance with the present invention, no limitation in that regard is intended.
  • Exemplary organisms include, but are not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), bacteria, fungi (e.g., yeast), viruses, etc.
  • the nucleic acids in the nucleic acid sample are derived from a mammal, where in certain embodiments the mammal is a human.
  • a nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules.
  • a nucleic acid molecule of interest can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, amplified DNA, a pre-existing nucleic acid library, etc.
  • a nucleic acid molecule of interest may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, chemical, enzymatic, degradation over time, etc. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc.
  • a nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
  • the reaction may or may s not use a fragmentation step.
  • the plurality of nucleic acid molecules comprises nucleic acid fragments, such as gDNA subject to fragmentation.
  • the shear force may be a hydrodynamic shear force, such as those generated by acoustic or mechanical means.
  • Hydrodynamic shearing of a nucleic acid can occur by any method known in the art, including passing the nucleic acid through a narrow capillary or orifice, referred to as "point-sink” shearing (Oefner et al, 1996; Thorstenson et al, 1998: Quail, 2010), acoustic shearing, or sonication.
  • the commercially available focused-ultrasonicators in conjunction with miniTUBEs or microTUBEs (Covaris, Woburn, MA; U.S. Patent Nos. 8,459,121; 8,353,619; 8,263,005; 7,981,368; 7,757,561), can randomly fragment DNA with distributions centered between 2-5 kb and 0.1-1.5 kb, respectively.
  • Sonication subjects nucleic acid to hydrodynamic shearing forces (Grokhovsky, 2006; Sambrook et al, 2006).
  • the commercially available Bioruptor (Diagenode; Denville, NJ; U.S. Patent Publn. No. 2012/0264228) use sonication to shear nucleic acids.
  • a nucleic acid fragment may have a size of about 100 bp, about 200 bp, about 300 bp, about 400 bp, about 500 bp, about 1000 bp, or about 2000 bp.
  • the nucleic acid fragments may have an average size of about 100 bp, about 200 bp, about 300 bp, about 400 bp, about 500 bp, about 1000 bp, or about 2000 bp.
  • a nucleic acid molecule may have a size of about 2000 bp, 5000 bp, 7500 bp, 10,000 bp, 20,000 bp, 30,000 bp, 40,000 bp, 50,000 bp, 60,000 bp, 70,000 bp, 80,000 bp, 90,000 bp, or 100,000 bp.
  • Nucleic acids may be, for example, RNA or DNA. Modified forms of RNA or DNA may also be used.
  • a given protocol may include a pooling step, e.g., where a first adaptor ligated composition is combined or pooled with the one or more additional adaptor ligated compositions.
  • a pooling step e.g., where a first adaptor ligated composition is combined or pooled with the one or more additional adaptor ligated compositions.
  • nucleic acid fragments tagged according to aspects of the subject invention are pooled with nucleic acid fragments derived from a plurality of sources (e.g., a plurality of organisms, tissues, cells, or subjects), where by "plurality" is meant two or more.
  • the number of different tagged compositions produced from different sources that are combined or pooled in such embodiments may vary, where the number ranges in some instances from 2 to 50, such as 3 to 25, including 4 to 20 or 10,000, or more.
  • the different tagged compositions Prior to or after pooling, can be amplified, e.g., by polymerase chain reaction (PCR), such as described above.
  • PCR polymerase chain reaction
  • the RNA molecule may be obtained from a sample, such as a sample comprising total cellular RNA, a transcriptome, or both; the sample may be obtained from one or more viruses; from one or more bacteria; or from a mixture of animal cells, bacteria, and/or viruses, for example.
  • the sample may comprise mRNA, such as mRNA that is obtained by affinity capture.
  • Obtaining nucleic acid molecules may comprise generation of the cDNA molecule by reverse transcribing the mRNA molecule with a reverse transcriptase, such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof.
  • a reverse transcriptase such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof.
  • PCRTM polymerase chain reaction
  • two synthetic oligonucleotide primers which are complementary to two regions of the template DNA (one for each strand) to be amplified, are added to the template DNA (that need not be pure), in the presence of excess deoxynucleotides (dNTP's) and a thermostable polymerase, such as, for example, Taq (Thermus aquaticus) DNA polymerase.
  • dNTP's deoxynucleotides
  • a thermostable polymerase such as, for example, Taq (Thermus aquaticus) DNA polymerase.
  • the target DNA is repeatedly denatured (around 90°C), annealed to the primers (typically at 50-60°C) and a daughter strand extended from the primers (72°C). As the daughter strands are created they act as templates in subsequent cycles.
  • the template region between the two primers is amplified exponentially, rather than linearly.
  • a second barcode such as a sample barcode
  • One method involves annealing a primer to the first barcoded nucleic acid molecule, the primer including a first portion complementary to the first barcoded nucleic acid molecule and a second portion including a second barcode; and extending the annealed primer to form a dual barcoded nucleic acid molecule, the dual barcoded nucleic acid molecule including the second barcode, the first barcode, and at least a portion of the nucleic acid molecule.
  • the primer may include a 3' portion and a 5' portion, where the 3' portion may anneal to a portion of the first barcode and the 5' portion comprises the second barcode.
  • DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by-synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing-by-synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.
  • the nucleic acid library may be generated with an approach compatible with
  • Illumina sequencing such as a NexteraTM DNA sample prep kit, and additional approaches for generating Illumina next-generation sequencing library preparation are described, e.g., in Oyola et al. (2012).
  • a nucleic acid library is generated with a method compatible with a SOLiDTM or Ion Torrent sequencing method (e.g. , a SOLiD® Fragment Library Construction Kit, a SOLiD® Mate-Paired Library Construction Kit, SOLiD® ChlP-Seq Kit, a SOLiD® Total RNA-Seq Kit, a SOLiD® SAGETM Kit, a Ambion® RNA-Seq Library Construction Kit, etc.). Additional methods for next-generation sequencing methods, including various methods for library construction that may be used with embodiments of the present invention are described, e.g., in Pareek (2011) and Thudi (2012).
  • the sequencing technologies used in the methods of the present disclosure include the HiSeqTM system (e.g., HiSeqTM 2000 and HiSeqTM 1000) and the MiSeqTM system from Illumina, Inc.
  • the HiSeqTM system is based on massively parallel sequencing of millions of fragments using attachment of randomly fragmented genomic DNA to a planar, optically transparent surface and solid phase amplification to create a high density sequencing flow cell with millions of clusters, each containing about 1 ,000 copies of template per sq. cm. These templates are sequenced using four-color DNA sequencing-by-synthesis technology.
  • the MiSeqTM system uses TruSeqTM, Illumina's reversible terminator-based sequencing-by-synthesis.
  • 454 sequencing involves two steps. In the first step, DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended. Oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors serve as primers for amplification and sequencing of the fragments. The fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5'-biotin tag.
  • DNA capture beads e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5'-biotin tag.
  • the fragments attached to the beads are PCR amplified within droplets of an oil- water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead.
  • the beads are captured in wells (pico-liter sized). Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated.
  • SOLiD sequencing genomic DNA is sheared into fragments, and adaptors are attached to the 5' and 3' ends of the fragments to generate a fragment library.
  • internal adaptors can be introduced by ligating adaptors to the 5' and 3' ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5' and 3' ends of the resulting fragments to generate a mate-paired library.
  • clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and beads are enriched to separate the beads with extended templates. Templates on the selected beads are subjected to a 3' modification that permits bonding to a glass slide.
  • IonTorrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way. Each well holds a different DNA template. Beneath the wells is an ion-sensitive layer and beneath that a proprietary Ion sensor. If a nucleotide, for example a C, is added to a DNA template and is then incorporated into a strand of DNA, a hydrogen ion will be released. The charge from that ion will change the pH of the solution, which can be detected by the proprietary ion sensor.
  • a nucleotide for example a C
  • the sequencer will call the base, going directly from chemical information to digital information.
  • the Ion Personal Genome Machine (PGMTM) sequencer then sequentially floods the chip with one nucleotide after another. If the next nucleotide that floods the chip is not a match, no voltage change will be recorded and no base will be called. If there are two identical bases on the DNA strand, the voltage will be double, and the chip will record two identical bases called. Because this is direct detection— no scanning, no cameras, no light— each nucleotide incorporation is recorded in seconds.
  • SMRTTM single molecule, real-time
  • each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked.
  • a single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW).
  • ZMW is a confinement structure which enables observation of incorporation of a single nucleotide by DNA polymerase against the background of fluorescent nucleotides that rapidly diffuse in and out of the ZMW (in microseconds). It takes several milliseconds to incorporate a nucleotide into a growing strand.
  • the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.
  • a further sequencing platform includes the CGA Platform (Complete
  • Genomics The CGA technology is based on preparation of circular DNA libraries and rolling circle amplification (RCA) to generate DNA nanoballs that are arrayed on a solid support (Drmanac et al. 2010).
  • Complete genomics' CGA Platform uses a novel strategy called combinatorial probe anchor ligation (cPAL) for sequencing. The process begins by hybridization between an anchor molecule and one of the unique adapters. Four degenerate 9-mer oligonucleotides are labeled with specific fluorophores that correspond to a specific nucleotide (A, C, G, or T) in the first position of the probe. Sequence determination occurs in a reaction where the correct matching probe is hybridized to a template and ligated to the anchor using T4 DNA ligase.
  • cPAL combinatorial probe anchor ligation
  • the ligated anchor-probe molecules After imaging of the ligated products, the ligated anchor-probe molecules are denatured. The process of hybridization, ligation, imaging, and denaturing is repeated five times using new sets of fluorescently labeled 9-mer probes that contain known bases at the n + 1, n + 2, n + 3, and n + 4 positions.
  • FIG. 2 provides a schematic depiction of a process for library construction according to an embodiment of the invention.
  • fragmented double stranded genomic DNA starting is combined with a population of non-degradable bud adaptors each comprising an MIS.
  • the 3' end of the genomic DNA is extended along the stem of the bud adaptor, through the MIS containing bud, until it reaches a non-replicable base in the loop.
  • Resultant primer binding sites initial provided in the non-ligateable stem region of the adaptors are then employed to amplify the DNA, where the amplified DNA includes sample barcode and P5/P7 domains, e.g., for Illumina NGS.
  • kits for creating libraries of target nucleic acids in a sample refers to a combination of physical elements.
  • a kit may include, for example, one or more components such as double-stranded nucleic acid adaptors or stem- loop adaptors, including without limitation specific primers, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein.
  • These physical elements can be arranged in any way suitable for carrying out the invention.
  • the kit may further comprise a polymerase, such as a strand displacing polymerase, including, for example, ⁇ 29 Polymerase, Bst Polymerase, Vent Polymerase, 9°Nm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 ' -5 ' exonuclease activity, or a mixture thereof.
  • a polymerase such as a strand displacing polymerase, including, for example, ⁇ 29 Polymerase, Bst Polymerase, Vent Polymerase, 9°Nm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 ' -5 ' exonuclease activity, or a mixture thereof.
  • kits may be packaged either in aqueous media or in lyophilized form.
  • the container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g., aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial.
  • the kits of the present invention also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained.
  • kits will also include instructions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented. It is contemplated that such reagents are embodiments of kits of the invention. Such kits, however, are not limited to the particular items identified above and may include any reagent used for the manipulation or characterization of the methylation of a gene.
  • the container means of the kits will generally include at least one vial, test tube, flask, bottle, or other container means, into which a component may be placed, and preferably, suitably aliquoted. Where there is more than one component in the kit, the kit also will generally contain additional containers into which the additional components may be separately placed. However, various combinations of components may be comprised in a container.
  • the kits of the present invention also will typically include a means for packaging the component containers in close confinement for commercial sale. Such packaging may include injection or blow-molded plastic containers into which the desired component containers are retained.
  • Libraries were prepared from both individual and pooled plasma samples obtained from donors. Cell-free DNA was isolated from the pooled plasma samples using the Qiagen QIAamp Circulating Nucleic Acid kit. Libraries were prepared as in the ThruPLEX ® Plasma-seq Kit (Rubicon Genomics ® ), including repairing the cfDNA to produce molecules with blunt ends, with the difference being ligation of the stem-loop adaptors depicted in FIG. 1 A to the 5' end of the cfDNA, leaving a nick at the 3' end of the target fragment. Next, the 3' ends of the cfDNA were extended to complete library synthesis and Illumina-compatible indexes were added by amplification.
  • the library was then processed and sequenced on the Illumina MiSeq, NextSeq500 on both mid- and high-output flow cells, as well as the HiSeq2500 and HiSeq3000. Sequencing data was generated using PicardTools.
  • each of the three adaptor designs were evaluated using sequencing analysis metrics, particularly the percentage of unmapped reads. While the non-degradable bubble (BB) design (FIG. 1A) was found to lose diversity due to collapse of the structure, the non-degradable bud (YB) design (FIG. 1C) was shown to have a significant reduction in the percentage of unmapped reads (FIG. 3 A). Further, the bud adaptor design with a 5' phosphorothioate bond in addition the 3' phosphorothioate bond (5PTYB-idSp) showed a significant reduction in the percentage of unmapped reads compared to the bud adaptor design with only the 3' phosphorothioate (YBidSp).
  • the bud adaptor design (YB-5PT-idSp) with 2 abasic sites in the main loop, phosphorothioate modification on the last 2 of the 3' bases, and the 1st of the 5' base to deter adapter dimer formation was used for the subsequent studies.
  • the bud stem -loop adaptors with dephased stems of 8, 10, and 12 nucleotides were ligated to a 0.5 ng pooled plasma input DNA and amplified for 14 cycles. All of the adaptors produced similar amplification results with the 10 or 8 bp stem adaptors amplifying slightly better than the 12 bp stem adaptors (FIG. 4) possibly due to better strand displacement during PCR. All of the bud adaptors showed a nice delta Ct between the samples and NTC libraries.
  • a population of double-stranded nucleic acid adaptors for ligating to a population of nucleic acid target molecules comprising:
  • asymmetric loop region between the ligateable stem region and the non-ligateable stem region, wherein the asymmetric loop region comprises a molecular identification sequence (MIS).
  • MI molecular identification sequence
  • double-stranded adaptors are further defined as a single-stranded nucleic acid molecule that under ligation conditions forms a stem-loop adaptor having a distal loop region attached to the non-ligateable stem region.
  • the double-stranded adaptors within the population comprise a mixture of double-stranded adaptors with a first primer binding site and double-stranded adaptors with a second primer binding site.
  • the ligateable stem region further comprises a variable stem region defined as a region whose length varies among the members of the population.
  • non-ligateable stem region comprises one or more mismatched bases.
  • nucleic acid target molecules comprises genomic DNA, fragmented DNA, cDNA, amplified DNA, or a nucleic acid library.
  • non-replicable base comprises a deoxyuridine or a ribonucleotide base.
  • a gap region on the strand opposite the asymmetric loop region has a length of at least one nucleotide shorter than that of the asymmetric loop region.
  • variable stem comprises 8-11 nucleotides.
  • variable stem comprises 8, 9, 10, or 11 nucleotides.
  • the ligateable stem region further comprises one or more replication blocks or cleavable bases between a 3' terminal end or the 5' terminal end and the asymmetric loop or gap region.
  • non-ligateable stem region further comprises one or more replication blocks or cleavable bases between the asymmetric loop or gap region and the distal loop region.
  • a method for producing a library of adaptor-bound target nucleic acids comprising:
  • strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem region adjacent to the asymmetric loop.
  • a method for producing a library of adaptor-bound target nucleic acids comprising:
  • strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem adjacent to the asymmetric loop.
  • a method for producing a library of adaptor-bound target nucleic acids comprising: (a) providing a population of target nucleic acid molecules, attaching to one end a first double-stranded nucleic acid adaptor according to any one of clauses 1-45, and attaching to the other end a second double-stranded nucleic acid adaptor optionally comprising a MIS, thereby generating a population of adapter-bound target nucleic acid molecules; and
  • first double-stranded nucleic acid adaptor comprises a first primer binding site in the non-ligateable stem region and the second double-stranded nucleic acid adaptor comprises a second primer binding site in the non-ligateable stem region.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided herein are double-stranded nucleic acid adaptors comprising molecular identification sequences. Further provided are methods of using the double-stranded nucleic acid adaptors for generating nucleic acid libraries, e.g., for amplification and sequencing.

Description

NUCLEIC ACID ADAPTORS WITH MOLECULAR IDENTIFICATION
SEQUENCES AND USE THEREOF
CROSS-REFERENCE TO RELATED APPLICATION
[0001] Pursuant to 35 U.S.C. § 119 (e), this application claims priority to the filing date of the United States Provisional Patent Application Serial No. 62/372,543, filed August 9, 2016; the disclosure of which application is herein incorporated by reference.
INTRODUCTION
[0002] Supplementing DNA ends with additional short polynucleotide sequences, referred to as adaptors or linkers, is used in many areas of molecular biology such as whole genome amplification and sequencing. Barcodes can be used to identify nucleic acid molecules, for example, where sequencing can reveal a certain barcode coupled to a nucleic acid molecule of interest. In some instances, a sequence-specific event can be used to identify a nucleic acid molecule, where at least a portion of the barcode is recognized in the sequence-specific event, e.g., at least a portion of the barcode can participate in a ligation or extension reaction. The barcode can therefore allow identification, selection or amplification of DNA molecules that are coupled thereto.
[0003] There are a variety of technologies to generate libraries of target nucleic acid samples tagged with barcodes. Generally, fragments of genomic DNA (gDNA) are ligated to adaptors, where at least one end of each fragment of genomic DNA is ligated to an adaptor including a barcode. The ligated adaptors and gDNA fragments may be nick repaired, size selected, and amplified by PCR with primers directed to the adaptors to produce an amplified library. For example, ligation adaptors each including one of 16 different barcodes can be used to prepare 16 different gDNA samples, each with a unique barcode, such that either each sample can be amplified separately by PCR using the same PCR primers and then pooled (mixed together) or each sample can be pooled first and then simultaneously amplified using the same PCR primers. Although there are many methods to use bar codes to tag samples, there are unmet needs to tag individual molecules within the input DNA sample. By tagging an input molecule with a unique molecular identification (i.e., UMI) sequence, the sequence and other properties of that single molecule can be distinguished from the sequences and other properties of other input molecules. These molecular tags are essential to identify and quantify different molecules in the same input sample that otherwise would be indistinguishable on the basis of sequence or other properties. Multiple sequences that have been amplified with the same molecular tag can be grouped together to remove artifacts that are created during the library amplification and sequencing processes. SUMMARY
[0004] Provided herein are double-stranded nucleic acid adaptors comprising molecular identifications sequences. Further provided are methods of using the double-stranded nucleic acid adaptors for generating nucleic acid libraries, e.g., for amplification and sequencing.
[0005] A first embodiment of the present disclosure provides a population of double- stranded nucleic acid adaptors for ligating to a population of nucleic acid target molecules, the double-stranded adaptors comprising a ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; a non-ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; and an asymmetric loop region between the ligateable stem region and the non-ligateable stem region, wherein the asymmetric loop region comprises a molecular identification sequence (MIS).
[0006] In some aspects, the double-stranded nucleic acid adaptors are further defined as a single-stranded nucleic acid molecule that under ligation conditions forms a stem-loop adaptor having a distal loop region attached to the non-ligateable stem region. In some aspects, the distal loop region comprises a non-replicable base. In certain aspects, the non-replicable base comprises an abasic site. In particular aspects, the abasic site comprises an l',2'-dideoxyribose. In specific aspects, the non- replicable base comprises a deoxyduridine or a ribonucleotide base.
[0007] In certain aspects, the non-ligateable stem region comprises a primer binding site. In particular aspects, the double-stranded adaptors within the population comprise a mixture of double- stranded adaptors with a first primer binding site and double-stranded adaptors with a second primer binding site. In certain aspects, the non-ligateable stem region comprises one or more mismatched bases.
[0008] In additional aspects, the ligateable stem region further comprises a variable stem region defined as a region whose length varies among the members of the population. In some aspects, the variable stem comprises 4-15 nucleotides. In certain aspects, the variable stem comprises 8-11 nucleotides. In particular aspects, the variable stem comprises 8, 9, 10, or 11 nucleotides.
[0009] In some aspects, the molecular identification sequence is unique to a subset of the population. In certain aspects, the molecular identification sequence is degenerate to a subset of the population. In further aspects, the molecular identification sequence is unique to a subset of the population and degenerate to another subset of the population. [0010] In certain aspects, the asymmetric loop region is formed between the terminal 3' end strand of the ligateable strand and the terminal 5' end strand of the non-ligateable strand. In other aspects, the asymmetric loop region is formed between the terminal 5' end strand of the ligateable strand and the terminal 3' end strand of the non-ligateable strand.
[0011] In some aspects, the double-stranded nucleic acid adaptors comprise DNA. In certain aspects, the double -stranded nucleic acid adaptors comprise RNA. In some aspects, the double- stranded nucleic acid adaptors comprise DNA and RNA.
[0012] In certain aspects, the population of nucleic acid target molecules comprise genomic DNA, fragmented DNA, cDNA, amplified DNA, or a nucleic acid library.
[0013] In further aspects, a gap region on the strand opposite the asymmetric loop region has a length of at least one nucleotide shorter than that of the asymmetric loop region. In some aspects, the length of the gap region is at least 2 nucleotides shorter than that of the asymmetric loop region. In certain aspects, the length of the gap region is less than 5 nucleotides. In particular aspects, the length of the gap region is 1 nucleotide. In specific aspects, the gap region has a length of a bond between two adjacent nucleotides. In some aspects, the gap region comprises a spacer incapable of base- pairing. In certain aspects, the spacer comprises an abasic site. In particular aspects, the abasic site comprises an l',2'-dideoxyribose.
[0014] In some aspects, the asymmetric loop region is 4 to 16 nucleotides in length. In certain aspects, the asymmetric loop region is 5 to 8 nucleotides in length. In particular aspects, the asymmetric loop region is 6 nucleotides in length.
[0015] In certain aspects, the molecular identification sequence comprises 5-10 nucleotides. In particular aspects, the molecular identification sequence comprises 6, 7, or 8 nucleotides. In specific aspects, the molecular identification sequence comprises 6 nucleotides. In some aspects, the molecular identification sequence is unique throughout the population. In further aspects, the molecular identification sequence is partially degenerate within the population. In certain aspects, the molecular identification sequence is degenerate within the population.
[0016] In some aspects, a 5' terminal end and/or 3' terminal end of the ligateable stem region comprise nucleotides having phosphorothioate linkages.
[0017] In certain aspects, a 5' terminal end of the ligateable stem comprises a ligation block. In some aspects, the ligation block is a dephosphorylated nucleotide, a 5' hydroxy 1 group, a dideoxy nucleotide, or an inverted dT. In further aspects, the ligateable stem region further comprises one or more replication blocks or cleavable bases between a 3' terminal end or the 5' terminal end and the asymmetric loop or gap region. In additional aspects, the non-ligateable stem region further comprises one or more replication blocks or cleavable bases between the asymmetric loop or gap region and the distal loop region. In certain embodiments, the cleavable base is inosine, uracil, or ribonucleotide.
[0018] In another embodiment, there is provided a method for producing a library of adaptor- bound target nucleic acids comprising providing a population of target nucleic acid molecules and attaching to each end a double-stranded nucleic acid adaptor according to the embodiments, thereby generating a population of adapter-bound target nucleic acid molecules; and replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation to make an exact copy of the asymmetric loop. In certain instances, the resultant library of adaptor-bound target nucleic acids has an MIS domain on each strand of the resultant double stranded molecule, which results in the ability to determine, during amplification, which strands came from the same original molecule. This ability may increase sequencing accuracy and the ability to identify/alleviate bias as compared to conventional methods in which a UMI (i.e. unique molecular identifier) is attached to only on one strand. In such conventional methods, the UMI containing strand gets amplified and if errors arise, it is not possible to determine whether such errors arise from all Crick strand errors or also Watson strand errors. In contrast, in embodiments of the present invention where each strand has the same MIS before amplification, bias may be identified and, if identified, compensation for such bias may be made. In certain of such embodiments, the methods may include determining amplification, e.g., PCR bias, e.g., by counting molecules, based on two MIS sequences from the same double stranded molecule (or two complementary MIS sequences from the same double- stranded molecule). In some aspects, the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group. In certain aspects, the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem region adjacent to the asymmetric loop. A further embodiment provides a library of adaptor-bound target nucleic acids produced by the methods of the embodiments.
[0019] In some aspects, the population of adaptor-bound target nucleic acid molecules comprises a first double-stranded nucleic acid adaptor with a first primer binding site attached on one end and a second double-stranded nucleic acid adaptor with a second primer binding site attached on the other end.
[0020] In certain aspects, attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang. [0021] In yet another embodiment, there is provided a method for producing a library of adaptor-bound target nucleic acids comprising providing a population of target nucleic acid molecules and attaching to one end a double-stranded nucleic acid adaptor according to the embodiments, thereby generating a population of adapter-bound target nucleic acid molecules; and displacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation, such that a complementary copy of the asymmetric loop is incorporated into the replaced strand. In some aspects, the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group. In certain aspects, the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem adjacent to the asymmetric loop. A further embodiment provides a library of adaptor-bound target nucleic acids produced by the methods of the embodiments.
[0022] In certain aspects, attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang.
[0023] Another embodiment provides a method for producing a library of adaptor-bound target nucleic acids comprising providing a population of target nucleic acid molecules, attaching to one end a first double-stranded nucleic acid adaptor according to the embodiments, and attaching to the other end a second double-stranded nucleic acid adaptor optionally comprising a MIS, thereby generating a population of adapter-bound target nucleic acid molecules; and replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation such that a complementary copy of the second strand is incorporated into the replaced strand. In some aspects, the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group. In certain aspects, the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the loop or in a region of the stem adjacent to the loop.
[0024] In some aspects, the second double-stranded nucleic acid adaptor does not comprise a MIS. In certain aspects, the second double-stranded nucleic acid adaptor does not comprise an asymmetric loop.
[0025] In particular aspects, the first double-stranded nucleic acid adaptor and/or second double-stranded nucleic acid adaptor are stem-loop adaptors. In some aspects, the first double- stranded nucleic acid adaptor comprises a first primer binding site in the non-ligateable stem region and the second double-stranded nucleic acid adaptor comprises a second primer binding site in the non-ligateable stem region.
[0026] In certain aspects, attaching is further defined as ligating. In some aspects, attaching is further defined as double strand ligation. In other aspects, attaching is further defined as single strand ligation. In particular aspects, attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands. In specific aspects, attaching is further defined as blunt end ligation. In one particular aspect, attaching is further defined as ligation to an overhang.
[0027] In additional aspects, the method further comprises preventing MIS switching. In some aspects, the method further comprises preventing MIS switching by contacting the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors with terminal deoxyribonucleotidyl transferase (TdT). In some aspects, the method further comprises preventing MIS switching by performing PCR purification on the adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors. In certain aspects, the first or second double-stranded nucleic acid adaptors comprise one or more uracils within the ligateable stem region. In some aspects, the method further comprises contacting the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors with a uracil excision reagent, e.g., a combination of Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII (such as the USER™ enzyme). In some aspects, the method further comprises contacting the library and excess adaptors with exonuclease I and incubating at a non-denaturing temperature prior to performing PCR.
[0028] Before various aspects of the present disclosure are described in greater detail, it is to be understood that the methods and compositions described herein are not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the methods will be limited only by the appended claims.
[0029] Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the methods. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges and are also encompassed within the methods, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the methods.
[0030] Certain ranges are presented herein with numerical values being preceded by the term "about." The term "about" is used herein to provide literal support for the exact number that it precedes, as well as a number that is near to or approximately the number that the term precedes. In determining whether a number is near to or approximately a specifically recited number, the near or approximating unrecited number may be a number which, in the context in which it is presented, provides the substantial equivalent of the specifically recited number.
[0031] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the methods belong. Although any methods similar or equivalent to those described herein can also be used in the practice or testing of the methods, representative illustrative methods and materials are now described.
[0032] All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present methods are not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
[0033] It is noted that, as used herein and in the appended claims, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as "solely," "only" and the like in connection with the recitation of claim elements, or use of a "negative" limitation.
[0034] It is appreciated that certain features of the methods, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the methods, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. All combinations of the embodiments are specifically embraced by the present invention and are disclosed herein just as if each and every combination was individually and explicitly disclosed, to the extent that such combinations embrace operable processes and/or devices/systems/kits. In addition, all sub-combinations listed in the embodiments describing such variables are also specifically embraced by the present methods and are disclosed herein just as if each and every such subcombination was individually and explicitly disclosed herein.
[0035] As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present methods. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
[0037] FIG. 1A-1D. (FIG. 1A) Schematic depicting designs of non-degradable bud adaptors (YB) comprising a molecular identification sequence (MIS) in an asymmetric loop, an abasic site (e.g., l',2'-Dideoxyribose (idSp)) opposite the MIS to facilitate correct adaptor folding, abasic sites in the distal loop to function as polymerase terminators, a variable-length stem to mediate ligation to nucleic acid molecules and provide base diversity, and phosphorothioate bonds (*) to prevent degradation through exonuclease activity and decrease adaptor dimer formation. (FIG. IB) Schematic depicting design of non-degradable bubble design (BB) comprising a MIS in a symmetric loop, abasic site(s) opposite the MIS, abasic sites in the main loop, and phosphorothioate bonds. (FIG. 1C) Schematic depicting design of degradable bubble with a self-complementary byproduct (RB) formed by "fold-back" synthesis that replicates the non-degraded bubble. (FIG. ID) Exemplary sequences of dephased stem region.
[0038] FIG. 2. Schematic depicting process for library construction starting with ligation of non-degradeable bud adaptors to gDNA and extension by strand displacement, followed by amplification resulting in 4<6+6) = 16.8 x 106 combinations.
[0039] FIG. 3A-3B. (FIG. 3A) Data from a NextSeq™ sequencing run of libraries prepared with 6 different designs of bud adaptors with a 5' phosphorothioate bond. Sequencing was analyzed with the Illumina® sample barcode trimmed and the additional adaptor sequence including the molecular barcode remaining. Run480 Trimmed was mapped to the human genome and metrics measured after manually trimming the entire molecular barcode and adaptor sequence. (FIG. 3B) Sequencing results as Q scores of the individual nucleotides starting with the MIS, proceeding across the ligatable stem and into the gDNA region, using bud adaptors comprised of two different ligatable stem lengths. [0040] FIG. 4. Amplification curves of libraries using the bud adaptor with or without a 5' phosphorothioate bond (5PT) and various stem lengths (e.g., 8, 10, and 12 nucleotides). (NTC: No template control).
[0041] FIG. 5A-5C. (FIG. 5A) Sequencing data from a run including bud adaptors with variable length stems. (FIG. 5B-5C) Input titration of 4-stem dephased bud adaptors (trimmed (FIG. 5B) and untrimmed (FIG. 5C)).
DETAILED DESCRIPTION
[0042] A significant problem in next generation sequencing (NGS) is distinguishing duplicate sequencing reads. Duplicates can be broadly categorized into two families: amplification duplicates and biological duplicates. Amplification duplicates, also known as PCR duplicates, arise as a consequence of amplification during library preparation prior to sequencing, or are generated on the flow cell of the sequencer. Biological duplicates or molecular duplicates, on the other hand, may be the result of the generation of two identical DNA or RNA fragments arising from biological means such as the generation of multiple short, identical mRNA molecules, or as a result of random fragmentation or enzymatic fragmentation of a number of copies of the genome. By tagging individual DNA or RNA molecules with random or unique sequences during ligation, prior to amplification, individual source molecules may be distinguished from one another, as each will have different unique identifiers. Following amplification, errors arising from PCR or sequencing can be informatically determined, and differentiated from true biological mutations. Both types of data can then be used for interpretation of sequencing data.
[0043] Accordingly, the present disclosure overcomes challenges associated with current technologies by providing double-stranded nucleic acid adaptors, such as stem-loop nucleic acid adaptors, with molecular identification sequences (i.e., MIS). These double-stranded nucleic acid adaptors can be used to tag individual template fragments with an MIS to create libraries which can be amplified and sequenced. In particular, the MIS can be used to identify, distinguish, and use duplicated sequences arising from both the biological source and subsequent amplifications. Accordingly, by tagging the molecules with a random sequence during ligation, rather than during amplification, the MIS allow for the differentiation between PCR duplicates and true biological duplicates, enabling PCR error correction and quantitative detection of low-frequency alleles with high statistical confidence. The adaptors with MIS provide the ability to distinguish between, and use the data from, amplification and biological duplicates in subsequent analyses as opposed to current methods of analysis which discard duplicates regardless of source. For example, the ligation of an adaptor with different molecular identification sequences of 6 base pairs each to each end of a template fragment would allow for 46x46=16.8xl06 total combinations of MIS adaptors within a library. Thus the likelihood of two identical sequences coming from different starting molecules having the same MIS is extremely low and biological duplicates could be parsed from subsequent amplification duplicates.
[0044] In further aspects, the adaptors may comprise replication stops such as non-replicable bases which can be used to stop replication at specific locations in the adaptor. The unreacted stem- loop adaptors are self-complementary and therefore usually unable to participate in unwanted priming reactions that might otherwise lead to MIS switching. Adaptors provided herein may be degradable or arenon-degradable adaptors. In some instances, thedaptors have an MIS, which may be located in an asymmetric loop (also referred to herein as a "bud") on either stem strand. The nucleotides between the loops may comprise a mismatched base. Across from the asymmetric loop may be an unpaired gap region which can comprise only the bond between adjacent nucleotides, unpaired nucleotides, or at least one non-replicable base or spacer, e.g., to reduce barcode bias, allow for correct folding of the adaptor and/or to prevent collapse of the asymmetric loop structure. Any convenient non-replicable base may be employed, where examples of non-replicable bases include, but are not limited to: abasic lesions (e.g., a tetrahydrofuran derivative, l',2'-Dideoxyribose (idSp), etc.); nucleotide adducts; iso- nucleotide bases (e.g., isocytosine, isoguanine, and the like), and any combination thereof, etc. Where desired, the non-replicable bases can also be present in the distal loop of the stem-loop adaptor, e.g., to function as polymerase terminators. In addition, the stem-loop adaptor can also have phosphorothioate bonds on the 3' terminal end and/or the 5' terminal end to protect the adaptor from degradation by proofreading enzymes and prevent adaptor dimers to optimize the signal-to-noise ratio.
[0045] In certain embodiments, the adaptors can have variable stem lengths to provide sufficient base diversity in the library at the beginning of the read for cluster detection and intensity correction, such as by RTA software, without the use of a control nucleic acid library, such as the PhiX control nucleic acid library. In addition, the variable stems can add further unique data which can add to the level of barcoding, e.g., where the variable stem may, where desired, be employed in combination with an MIS domain to provide a unique molecular barcode. Diversity in the stems allows for low amplification background and the generation of fewer unmapped reads during sequencing. The variable stem lengths also provide extra unique sequence information for informatic analysis of sequencing data.
[0046] In certain embodiments, a population of adaptors may include a common barcode domain, e.g., a region or sequence of nucleic acids that is common or the same among the population of adaptors and serves as a barcode or identifier for a source of target nucleic acids to which the adaptors are ligated during use. Such a barcode domain may serve as an identifier of a sample from which the target nucleic acids are obtained, such that it may be viewed as a sample barcode. When present, the barcode domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
[0047] In certain embodiments where the adaptors include a barcode domain, the barcode domain may be combined with the MIS domain, e.g., such that the adaptors include a barcode/MIS domain. In such instances, a barcode/MIS domain is made up of a series of interspersed barcode and MIS bases. By interspersed is meant that the bases which are barcode bases (i.e., the bases that collectively make up the barcode component of a barcode/MIS domain) are distributed or positioned among MIS bases (i.e., the bases that collectively make up the MIS domain of a barcode/MIS domain). As such, a given barcode/MIS domain is one that includes at least one MIS base positioned adjacent to at least one barcode base, where in those instances in which the barcode/MIS domain is made up of 3 or more bases, at least two bases of a first type (e.g., MIS or barcode) may be separated by at least one base of another type (e.g., MIS or barcode). The length of a given barcode/MIS domain may vary, ranging in some instances from 4 to 50 nts, where in some instances the length ranges from 5 to 25 nts, e.g., 6 to 20 nts, where specific lengths of interest include, but are not limited to: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 nts. Further details regarding such domains may be found in United States Application Serial No. 62/401,676; the disclosure of which is herein incorporated by reference. When present, the barcode/MIS domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
[0048] In addition, the present disclosure provides methods for the production of libraries using the nucleic acid adaptors disclosed herein which are compatible with the major sequencing platforms including, but not limited to, Illumina's MiSeq, NextSeq, and HiSeq using conventional flow cells. Libraries are also compatible with hybridization capture target enrichment platforms, such as those manufactured by Agilent and NimbleGen. Libraries produced with the non-degradable bud stem-loop adaptors performed well across a wide target input range, including low inputs which can be difficult to amplify and sequence. The template fragments can include DNA, such as cell-free DNA (cfDNA) isolated from plasma, urine, or cerebrospinal fluid samples or isolated genomic DNA (gDNA) which has been subjected to fragmentation, or RNA. To create the libraries, the template fragments are end repaired and at least one end of each template fragment is ligated to a stem-loop adaptor. The template fragment may have an adaptor with a MIS on one end and an adaptor without an MIS on the other end. In other aspects, the template fragment may have an adaptor with an MIS on both ends.
[0049] In one method, a stem-loop adaptor comprising a first primer binding site in the non- ligateable stem is ligated to one end of a template nucleic acid and a stem-loop adaptor comprising a second primer binding site in the non-ligateable stem is ligated to the other end of the template nucleic acid. The two stem-loop adaptors may comprise different sequences or comprise a common sequence to promote suppression of amplification of short ligation products, i.e., adaptor dimers or short inserts that will be suppressed to reduce background from the unwanted adaptor dimers or short gDNA fragments. Accordingly, the use of two stem-loop adaptors prevents suppression of amplification that is inherent to the use of a single adaptor, which facilitates sequencing on sequencing platforms such as those manufactured by Illumina. In addition, the suppression prevents molecules which have two copies of the first adapter with the first primer binding site or two copies of the second adapter with the second primer binding site from amplification.
[0050] Further, methods of preventing MIS switching are provided herein. MIS switching is non-specific replacement of a MIS, originally assigned to a target molecule by the attachment chemistry of a library preparation. MIS switching may occur during PCR amplification, where residual adaptors or their byproducts carried over into the PCR step act as primers that randomly replace the original MIS with a different (non-specific) MIS. In one method, after completion of the ligation and nucleic acid extension steps, terminal deoxyribonucleotidyl transferase (TdT) is used to add 3' tails to the unreacted stem-loop adaptors and, therefore, block priming of the unligated adaptors to the library molecules. The TdT is then inactivated and will not interfere with PCR. In another method, AMPure® clean-up can be used after ligation and prior to PCR to remove unreacted adaptors. Another method to reduce MIS switching involves inactivation of unligated stem-loop adaptors. In this method, the stem-loop adaptor comprises one or more uracil residues within the terminal 5' end strand of the ligateable stem region or within the terminal 3' end strand of the non-ligateable stem region which are converted to abasic sites and degraded by a suitable enzyme, such as USER™ (Uracil-Specific Excision Reagent), to produce several short, single-stranded oligonucleotide products from the terminal 5' end strand of the ligateable stem region or the terminal 3' end strand of the non- ligateable stem region as well as an intact single-stranded oligonucleotide from the terminal 3' end strand of the ligateable stem region or the terminal 5' end strand of the non-ligateable stem region. Exonuclease I is then added to the PCR premix which is incubated a temperature too low to denature the DNA; thus, the short fragments from the unreacted stem-loop adaptors and adaptor dimers are degraded.
[0051] In another embodiment, MIS switching is reduced by adding one or more uracils, such as 2, 3, or 4 uracils, near the 3' terminal end of the ligateable stem region of a first adaptor {e.g., the adaptor with a first primer binding site) having the MIS. The second adaptor does not have a MIS and has the second primer binding site. After ligation, the residual activities of the repair enzymes will extend both 3' ends of the gDNA to make copies of the 5' ends of the adaptors even in the presence of uracil in one of the adaptors. Upon addition of the PCR polymerase, the 5' end of the second adaptor will not be replicated due to the uracil in the template; however, the 5' end of the first adaptor will be replicated to make a copy of the MIS. One strand will be replicated normally by PCR as it does not have uracil at its 5' end, and the second strand will not be PCR amplified as the PCR polymerase cannot read through uracil. This method of making a nucleic acid library from one of the two strands overcomes challenges associated with duplex sequencing. Duplex sequencing requires enough reads from the forward and reverse strand to get accurate sequencing from both. When only a single strand is sequenced, the sequencing resources can instead be devoted to sequencing the single strand more deeply or sequencing a strand from a different gDNA molecule. Thus, this method can be used in cases where very deep sequencing is not possible or when small insertions or deletions and translocations are being detected.
[0052] In some embodiments, un-ligated adaptor is removed by a 3 ' exonuclease active on both blunt and recessed 3' ends (e.g., E. coli exonuclease III), or a combination of a 5' exonuclease active on blunt and recessed 5' ends (e.g., E. coli exonuclease VIII) and a 3 ' exonuclease active on 3' protruding ends (e.g., exonuclease T). In the latter case, the 5 ' exonuclease can expose a 3 ' extension that can be substrate for the 3' exonuclease. In all cases, when adaptors of the disclosure are fully ligated they may not have any free ends and therefore may be protected from cleavage by the exonucleases.
[0053] In some embodiments, a 3 '-protected, un-extendable blocker oligonucleotide can be added to the amplification reaction. The blocker oligonucleotide can be at 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more nucleotides in length. The blocker oligonucleotide can, in some instances, be at most 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more nucleotides in length. The blocker oligonucleotide can be fully complementary to the 3' end of the 3 ' strand of the adaptor. In some instances, the blocker oligonucleotide is fully complementary to the adaptor stem sequence (e.g., dephase stem). An excess of such blocker may prevent priming by the free adaptor yet may not prevent priming by the PCR primer, which does not contain the stem sequence and is therefore still able to anneal to and prime target nucleic acid molecules.
[0054] In certain aspects, the 3' terminal end of the ligateable stem region of the double- stranded nucleic acid adaptor is ligated to the 5' phosphate of the target fragment leaving a nick between the 3' end of target fragment and the 5' terminal end of the ligateable region of the double- stranded nucleic acid adaptor. Polymerase extension is then performed on the adaptor-bound template by extending the 3' end of the template fragment end toward the end of the double-stranded nucleic acid adaptor, copying the molecular identification sequence, during strand displacement or nick translation. The adaptor-bound target fragments are then amplified to create libraries and may then be sequenced. Thus, the MIS adaptors described herein can be used for amplification and/or sequencing, such as to generate nucleic acid libraries for sequencing. I. Definitions
[0055] "Nucleotide," as used herein, is a term of art that refers to a base-sugar-phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e., of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
[0056] A "nucleoside" is a base-sugar combination, i.e. , a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide. For example, the nucleotide deoxyuridine triphosphate, dUTP, is a deoxyribonucleoside triphosphate. After incorporation into DNA, it serves as a DNA monomer, formally being deoxyuridylate, i.e. , dUMP or deoxyuridine monophosphate. One may say that one incorporates dUTP into DNA even though there is no dUTP moiety in the resultant DNA. Similarly, one may say that one incorporates deoxyuridine into DNA even though that is only a part of the substrate molecule.
[0057] The term "nucleic acid" or "polynucleotide" will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. adenine "A," guanine "G," thymine "T" and cytosine "C") or RNA (e.g. A, G, uracil "U" and C). The term "nucleic acid" encompasses the terms "oligonucleotide" and "polynucleotide." The term "oligonucleotide" refers to at least one molecule of between about 3 and about 100 nucleobases in length. The term "polynucleotide" refers to at least one molecule of greater than about 100 nucleobases in length. These definitions generally refer to at least one single-stranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary to at least one single-stranded molecule. Thus, a nucleic acid may encompass at least one double-stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or "complement(s)" of a particular sequence comprising a strand of the molecule. As used herein, a single stranded nucleic acid may be denoted by the prefix "ss", a double-stranded nucleic acid by the prefix "ds", and a triple stranded nucleic acid by the prefix "ts."
[0058] A "nucleic acid molecule" or "nucleic acid target molecule" refers to any single- stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof. For example and without limitation, the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2' -deoxyribose group. The nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA. For example, and without limitation, mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase. A nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc. A nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc. A nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
[0059] "Analogous" forms of purines and pyrimidines are well known in the art, and include, but are not limited to aziridinylcytosine, 4-acetylcytosine, 5-fluorouracil, 5-bromouracil, 5- carboxymethylaminomethyl-2-thiouracil, 5-carboxymethylaminomethyluracil, inosine, N6- isopentenyladenine, 1 -methyladenine, 1-methylpseudouracil, 1 -methylguanine, 1-methylinosine, 2,2- dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N.sup.6- methyladenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5-methoxyuracil, 2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid methylester, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4- thiouracil, 5-methyluracil, uracil-5-oxy acetic acid, and 2,6-diaminopurine. The nucleic acid molecule can also contain one or more hypermodified bases, for example and without limitation, 5- hydroxymethyluracil, 5-hydroxyuracil, a-putrescinylthymine, 5-hydroxymethylcytosine, 5- hydroxycytosine, 5-methylcytosine,— methyl cytosine, 2-aminoadenine, acarbamoylmethyladenine, N' -methyladenine, inosine, xanthine, hypoxanthine, 2,6-diaminpurine, and Ν7 -methylguanine. The nucleic acid molecule can also contain one or more non-natural bases, for example and without limitation, 7 -deaza-7 -hydroxymethyladenine, 7 -deaza-7- hydroxymethylguanine, isocytosine (isoC), 5-methylisocytosine, and isoguanine (isoG). The nucleic acid molecule containing only canonical, hypermodified, non-natural bases, or any combinations the bases thereof, can also contain, for example and without limitation where each linkage between nucleotide residues can consist of a standard phosphodiester linkage, and in addition, may contain one or more modified linkages, for example and without limitation, substitution of the non-bridging oxygen atom with a nitrogen atom (i.e., a phosphoramidate linkage, a sulfur atom (i.e., a phosphorothioate linkage), or an alkyl or aryl group (i.e., alkyl or aryl phosphonates), substitution of the bridging oxygen atom with a sulfur atom (i.e., phosphorothiolate), substitution of the phosphodiester bond with a peptide bond (i.e., peptide nucleic acid or PNA), or formation of one or more additional covalent bonds (i.e., locked nucleic acid or LNA), which has an additional bond between the 2' -oxygen and the 4' -carbon of the ribose sugar.
[0060] Nucleic acid(s) that are "complementary" or "complement(s)" are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules. As used herein, the term "complementary" or "complement(s)" may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above. The term "substantially complementary" may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase. In certain embodiments, a "substantially complementary" nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base - pairing with at least one single or double-stranded nucleic acid molecule during hybridization. In certain embodiments, the term "substantially complementary" refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent conditions. In certain embodiments, a "partially complementary" nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
[0061] "Incorporating," as used herein, means becoming part of a nucleic acid polymer.
[0062] "Oligonucleotide," as used herein, refers collectively and interchangeably to two terms of art, "oligonucleotide" and "polynucleotide." Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein. The term "adaptor" may also be used interchangeably with the terms "oligonucleotide" and "polynucleotide."
[0063] "Amplification," as used herein, refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 30-100 "cycles" of denaturation and replication. [0064] "Polymerase chain reaction," or "PCR," means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA. In other words, PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates. Usually, the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument. Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g., exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).
[0065] "Primer" means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase. Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in the range of between 18-40, 20-35, 21-30 nucleotides long, and any length between the stated ranges. Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges. In some embodiments, the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.
[0066] The terms "hairpin," "stem-loop oligonucleotide," "stem-loop nucleic acid adaptor" and "stem-loop adaptor" as used herein refer to a structure formed by an oligonucleotide comprised of 5' and 3' terminal regions, which are intramolecular inverted repeats that form a double-stranded stem, and a non-self-complementary central region, which forms a single -stranded loop. In some embodiments, the stem-loop oligonucleotide further comprises a second or third single-stranded loop, such as within the 5' stem and/or the 3' stem. An "asymmetric loop" refers to a single-stranded loop on only one stem strand with a "gap region" of unpaired bases across from the asymmetric loop.
[0067] The term "non-complementary" refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.
[0068] "Cleavable base," as used herein, refers to a nucleotide that is generally not found in a sequence of DNA. For most DNA samples, deoxyuridine is an example of a cleavable base. Although the triphosphate form of deoxyuridine, dUTP, is present in living organisms as a metabolic intermediate, it is rarely incorporated into DNA. When dUTP is incorporated into DNA, the resulting deoxyuridine is promptly removed in vivo by normal processes, e.g., processes involving the enzyme uracil-DNA glycosylase (UDG) (U.S. Patent No. 4,873,192; Duncan, 1981; both references incorporated herein by reference in their entirety). Thus, deoxyuridine occurs rarely or never in natural DNA. Non-limiting examples of other cleavable bases include deoxyinosine, bromodeoxyuridine, 7-methylguanine, 5,6-dihyro-5,6 dihydroxydeoxythymidine, 3- methyldeoxadenosine, etc. (see, Duncan, 1981). Other cleavable bases will be evident to those skilled in the art.
[0069] The term "degenerate" as used herein refers to a nucleotide or series of nucleotides wherein the identity can be selected from a variety of choices of nucleotides, as opposed to a defined sequence. In specific embodiments, there can be a choice from two or more different nucleotides. In further specific embodiments, the selection of a nucleotide at one particular position comprises selection from only purines, only pyrimidines, or from non-pairing purines and pyrimidines.
[0070] A "non-replicable base" refers to a position at which polymerization ceases. For example, the non-replicable base or sequence may comprise an abasic site or sequence, hexaethylene glycol, and/or a bulky chemical moiety attached to the sugar-phosphate backbone or the base.
[0071] As understood by those in the art, an "abasic site" lacks a base at a position in the oligonucleotide, i.e., the sugar residue is present at the position in the probe, but the purine or pyrimidine (nucleobase) group has been removed or replaced. One or more abasic sites may become incorporated into one or more locations in an oligonucleotide.
[0072] The term "ligase" as used herein refers to an enzyme that is capable of joining the 3' hydroxyl terminus of one nucleic acid molecule to a 5' phosphate terminus of a second nucleic acid molecule to form a single molecule. The ligase may be a DNA ligase or RNA ligase. Examples of DNA ligases include E. coli DNA ligase, T4 DNA ligase, and mammalian DNA ligases.
[0073] The term "molecular identifier sequence(s)" (or "MIS") as used herein refers to a unique nucleotide sequence that is used to distinguish between a single cell or genome or a subpopulation of cells or genomes, and to distinguish duplicate sequences arising from amplification from those which are MIS can be linked to a target nucleic acid of interest by ligation prior to amplification, or during amplification {e.g., reverse transcription or PCR), and used to trace back the amplicon to the genome or cell from which the target nucleic acid originated. A MIS can be added to a target nucleic acid by including the sequence in the adaptor to be ligated to the target. A MIS can also be added to a target nucleic acid of interest during amplification by carrying out reverse transcription with a primer that contains a region comprising the barcode sequence and a region that is complementary to the target nucleic acid such that the barcode sequence is incorporated into the final amplified target nucleic acid product (i.e., amplicon). The MIS may be any number of nucleotides of sufficient length to distinguish the MIS from other MIS. For example, a MIS may be anywhere from 4 to 20 nucleotides long, such as 5 to 11, or 12 to 20. In particular aspects, the MIS has a length of 6 random nucleotides. The term "molecular identifier sequence," "MIS," "unique molecular identifier," "UMI," "molecular barcode," "molecular identifier sequence", "molecular tag sequence" and "barcode" are used interchangeably herein.
[0074] "Sample" means a material obtained or isolated from a fresh or preserved biological sample or synthetically -created source that contains nucleic acids of interest. In certain embodiments, a sample is the biological material that contains the variable immune region(s) for which data or information are sought. Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
II. Double-Stranded Nucleic Acid Adaptors
[0075] In some embodiments, the present disclosure provides synthetic oligonucleotides which form double-stranded adaptors for use in the generation of nucleic acid libraries. In certain embodiments, the double-stranded adaptors are stem-loop adaptors comprising a distal loop. The synthetic oligonucleotides which form the double-stranded adaptors can have a length of 20 to 100 nucleotides, such as 50 to 80 nucleotides, such as between 60 and 70 nucleotides. Exemplary structures of the double-stranded nucleic acid adaptors, such as a bud adaptor, are provided in FIGs. 1A-1C. Generally, the synthetic oligonucleotides which form a bud adaptor comprise a double- stranded ligateable stem region and a double stranded non-ligateable stem region, separated by a bud (also referred to herein as an asymmetric loop region). Each double-stranded region has a 5' end stem strand and a 3' end stem strand. The 3' end and the 5' end can form a blunt end or a staggered end. In certain aspects, the double-stranded regions have blunt ends. The asymmetric loop (i.e., bud) comprises a molecular identification sequence (MIS) and can be located on either strand between the ligateable and non-ligateable stem regions.
[0076] The double-stranded nucleic acid adaptor may further comprise a gap region on the strand opposite of the bud. The gap region may only comprise the bond between adjacent nucleotides or a region of non-paired nucleotides. The gap region can be on either strand between the ligateable and non-ligateable stem regions. In particular aspects, the gap region comprises a non-replicable base, such as an abasic site or spacer.
[0077] The double-stranded nucleic acid adaptor mayfurther comprises the non-ligateable stem region which in a stem-loop will be between the distal loop region and the asymmetric loop region. In particular aspects, this region comprises one or more mismatched bases.
[0078] The double-stranded nucleic acid adaptor may further comprise a primer binding site with a known sequence. The primer binding site may be located in the non-ligateable stem region or the distal loop. In certain aspects, a forward primer binding site is located between the gap region and distal loop, and a reverse primer binding site is located between the asymmetric loop and the main loop. For example, the adaptor may comprise flow cell binding sequences, such as P5 and/or P7, or fragments thereof. In some aspects, a first adaptor comprises a P5 sequence and a second adaptor comprises a P7 sequence. Further, the adaptor can comprise part or all of sequencing primer sequences or their binding sites such as index sequencing primers for particular sequencing platforms (e.g., Illumina index primers).
[0079] In certain embodiments, an adaptor may include a barcode domain, e.g., a region or sequence of nucleic acids that serves as a barcode or identifier for a source of target nucleic acids to which the adaptor is ligated during use. Such a barcode domain may serve as an identifier of a sample from which the target nucleic acids are obtained, such that it may be viewed as a sample barcode. When present, the barcode domain may be positioned at any convenient location in the adaptor, such as the non-ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc. In certain embodiments where the adaptors include a barcode domain, the barcode domain may be combined with the MIS domain, e.g., such that the adaptors include a barcode/MIS domain. In such instances, a barcode/MIS domain is made up of a series of interspersed barcode and MIS bases. By interspersed is meant that the bases which are barcode bases (i.e., the bases that collectively make up the barcode component of a barcode/MIS domain) are distributed or positioned among MIS bases (i.e., the bases that collectively make up the MIS domain of a barcode/MIS domain). As such, a given barcode/MIS domain is one that includes at least one MIS base positioned adjacent to at least one barcode base, where in those instances in which the barcode/MIS domain is made up of 3 or more bases, at least two bases of a first type (e.g., MIS or barcode) may be separated by at least one base of another type (e.g., MIS or barcode). The length of a given barcode/MIS domain may vary, ranging in some instances from 4 to 50 nts, where in some instances the length ranges from 5 to 25 nts, e.g., 6 to 20 nts, where specific lengths of interest include, but are not limited to: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 nts. Further details regarding such domains may be found in United States Application Serial No. 62/401,676; the disclosure of which is herein incorporated by reference. When present, the barcode/MIS domain may be positioned at any convenient location in the adaptor, such as the non- ligateable stem region of the adaptor, the asymmetric loop of the adaptor, etc.
[0080] Embodiments of the methods described herein employ populations or collections of stem loop adaptors, e.g., as described above. Such populations or collections may be made up of a plurality of different, i.e., distinct, step loop adaptors that differ from each other in terms of sequence. In some instances, the plurality of different stem loop adaptors is made up of adaptors that have regions of common sequence (e.g., the stem regions, the loop regions) and regions of differing sequence (e.g., the MIS containing regions, the dephased stem region, etc.). For example, a population of stem loop adaptors according to embodiments of the invention may be made up of stem loop adaptors in which the only region that differs among the population is the MIS containing region, e.g., the disparate distinct adaptor members of the population only differ from each other in terms of their MIS sequences. The number of distinct stem loop adaptors in a given population that is employed in embodiments of the invention may vary, where in some instances the amount is 10 or more, such as 50 or more, 100 or more, 500 or more, 1,000 or more, 5,000 or more, 10,000 or more, 50,000 or more, 100,000 or more, 250,000 or more, 500,000 or more, 1,000,000 or more, 5,000,000 or more, 10,000,000 or more, 20,000,000 or more, where in some instances the number is 50,000,000 or less, such as 25,000,000 or less, including 20,000,000 or less, where in some instances the number is 10,000,000 or less, 5,000,000 or less, 1,000,000 or less, 500,000 or less, 100,000 or less, including 50,000 or less.
A. Molecular Identification Sequence
[0081] Incorporation of a molecular identification sequence (MIS) within the stem-loop adaptors, particularly in an asymmetric loop of the adaptor, allows for the tagging of individual source molecules for subsequent informatic analysis, and provides diversity and balance to analyze samples of high complexity. The barcode or molecular identifier sequence within the asymmetric loop can have a length of 4 to 15 nucleotides, such as 5 to 10 nucleotides, such as 5, 6, 7, 8, 9, or 10 nucleotides. In some aspects, the asymmetric loop has a length of 6 nucleotides resulting in 16.8xl06 total possible combinations of MIS adaptors within a library. In some aspects, the random barcode sequence is generated by using a mixture of A, G, C, and/or T for incorporation of nucleotides into the MIS of the double-stranded nucleic acid adaptor.
[0082] Across from the asymmetric loop is a gap region which can comprise the bond between two adjacent nucleotides, unpaired bases or at least one non-replicable base or spacer to allow for correct folding of the adaptor and to prevent collapse of the asymmetric loop structure. Accordingly, the gap region across from the asymmetric loop has a length at least 1 nucleotide less than the length of the asymmetric loop. In some aspects, the gap region is at least 2, or up to at least 5, nucleotides shorter than the length of the asymmetric loop. For example, an adaptor with an asymmetric loop of 6 nucleotides would have a gap region of less than 6 nucleotides, such as 5, 4, 3, 2, 1, or 0 (e.g., nucleotide bond) nucleotides in length. In particular aspects, the gap region has a length of 1 nucleotide, such as one non-replicable base, particularly one abasic site.
[0083] As reviewed above, barcodes may be employed. Barcoding is described, e.g. , in U.S. Pat. 7,902, 122. Methods of using stem loop adaptor ligation and primer extension or PCR to add additional sequences are described, e.g. , in U.S. Pat. 7,803,550, which is incorporated by reference herein in its entirety. Barcode incorporation by primer extension, for example via PCR, may be performed using methods described in U.S. 5,935,793 or US 2010/0227329. In some embodiments, a barcode may be incorporated into a nucleic acid via using ligation, which can then be followed by amplification; for example, methods described in U.S. Pat. 5,858,656, U.S. Pat. 6,261,782, U.S. Pat. Publn. 2011/0319290, or U.S. Pat. Publn. 2012/0028814 may be used with the present invention. In some embodiments, one or more barcode may be used, e.g. , as described in U.S. Pat. Publn. 2007/0020640, U.S. Pat. Publn. 2009/0068645, U.S. Pat. Publn. 2010/0273219, U.S. Pat. Publn. 2011/0015096, or U.S. Pat. Publn. 2011/0257031.
B. Dephased Stem
[0084] In some embodiments, the double-stranded nucleic acid adaptors further comprise a variable length stem region in the ligateable stem region between the terminal end (e.g., the 5' terminal end and/or the 3' terminal end) and the bud or gap region. The variable stem provides sufficient diversity in the library at the beginning of the read for cluster detection and intensity correction without a control nucleic acid library, such as the PhiX control nucleic acid library. The variable stems also provide more unique information for distinguishing between sequences bioinformatically. The variable stems (also referred to herein as the dephased stems) can differ by a single nucleotide (e.g., FIG. ID), such as having a length of n, n+1, n+2, and n+3. Accordingly, a population of double-stranded nucleic acid adaptors, e.g., as described above, can comprise a mixture of double-stranded nucleic acid adaptors having a stem length of n, n+1, n+2, and n+3, such as 8, 9, 10, and 11. In some aspects, the variable stems have a length n between 3-20 nucleotides, particularly 6-15 nucleotides, such as 6, 7, 8, 9, 10, 11, or 12 nucleotides. For example, dephased stems are described in Lundberg et al., 2013, and Wu et al., 2015. Exemplary variable-length stem sequences within one subset of double-stranded nucleic acid adaptors include TGAGCTAC, TGAGCTACT, TGAGCTACTG, and TGAGCTACTGA as well as the sequences disclosed in FIG. ID.
[0085] In some aspects, the terminal end, such as the 5' end, comprises a ligation block. In certain aspects, the ligation block is a dephosphorylated nucleotide, a 5' hydroxy 1, or an inverted base, such as inverted dT. In specific aspects, a 5' end of the double-stranded nucleic acid adaptor oligonucleotide lacks a phosphate.
[0086] In some aspects, the 5 'end and/or 3 'end has at least one phosphorothioate bond. The phosphorothioate bond can protect the adaptor from degradation by proofreading enzymes (e.g., 5 '-3' exonuclease) and prevent unwanted ligation products or adaptor dimers. In particular embodiments, the double-stranded nucleic acid adaptor has a phosphorothioate modification on the last 2 bases of the 3' terminal end, and the 1st base of the 5' terminal end to deter adapter dimer formation and optimize the signal-to-noise ratio. Other such exonuclease resistant modifications may include phosphorodithioates, methyl phosphonates and 2'-0-methyl sugars, either separately or in combination. A number of other modifications are known to reduce the exonuclease degradation of single DNA strands, including phosphoramidites (P-NR2), phosphorofluoridates (P-F), boranophosphanes (P-BH3) or phosphoroselenoates (P-Se), and modifications to the sugar rings, such as 2'-0 alkyl groups, 2'-fluoro groups, 2' -amino groups such as 2-amino propyl.
C. Non-Replicable Bases
[0087] In some embodiments, the double-stranded nucleic acid adaptor comprises a replication stop or non-replicable base. In certain aspects, the gap region may comprise a non- replicable base or spacer, such as an abasic site or cleavable base. In certain aspects, wherein the double-stranded nucleic acid adaptor comprises a stem-loop, the distal loop of the stem-loop oligonucleotide adaptor may comprise a non-replicable base or spacer, such as an abasic site or cleavable base. The replication stop may be at the 5' end of the stem, the 3' end of the stem, or proximal to the distal loop. The non-replicable base can function as a polymerase terminator and facilitates correct adaptor folding. Correct adaptor folding, facilitated by the use of non-replicable bases, also prevents spurious priming by excess stem loop adaptors as the folded, stem-loop conformation is thermodynamically favored rather than hybridization to library molecules. The adaptor may comprise at least 2, 3, 4, 5, 6 or more non-replicable bases depending on the length of the adaptor. Non-replicable bases include, but are not limited to, l',2'-dideoxyribose (idSp), and deoxyuridine. Cleavable bases include, but are not limited to: uracil, inosine or a ribonucleotide. The term spacer means a hydrocarbon residue with preferably one to six carbon atoms, preferably an alkdiyl group with 2 to 4 carbon atoms, most preferred linear C3 (5'-C3-spacer).
[0088] Double-stranded nucleic acid adaptors comprising cleavable bases can be cleaved by enzymes or chemical reagents. Examples of cleaving agents include DNA repair enzymes, glycosylases, DNA cleaving endonucleases, ribonucleases and silver nitrate. For example, cleavage at dU may be achieved using uracil DNA glycosylase and endonuclease VIII (USER™, NEB, Ipswich, Mass.) (U.S. Pat. No. 7,435,572). Where the modified nucleotide is a ribonucleotide, the adapter can be cleaved with an endoribonuclease.
[0089] Abasic sites can be recognized and cleaved by AP endonucleases and/or AP lyases. Class II AP endonucleases cleave at AP sites to leave a 3' OH that can be used in polynucleotide polymerization. Furthermore, AP endonucleases can remove moieties attached to the 3' OH that inhibit polynucleotide polymerization. For example a 3' phosphate can be converted to a 3' OH by E. coli endonuclease IV. AP endonucleases can work in conjunction with glycosylases.
D. Illustrative Embodiments
[0090] FIGS. 1A-1D provide depictions of illustrative embodiments of MIS containing adaptors according to certain embodiments of the invention. In FIG. 1A, a non-degradable bud adaptor is illustrated, where the adaptor comprises a molecular identification sequence (MIS) in an asymmetric loop (or bud), an abasic site {e.g., 1 ',2'-Dideoxyribose (idSp)) opposite the MIS to facilitate correct adaptor folding, abasic sites in the distal loop to function as polymerase terminators, a variable ength stem to mediate ligation to nucleic acid molecules and provide base diversity, and phosphorothioate bonds (*) to prevent degradation through exonuclease activity and decrease adaptor dimer formation. FIG. IB provides a schematic depicting a non-degradable bubble adaptor (BB) comprising a MIS in a symmetric loop, abasic site(s) opposite the MIS, abasic sites in the main loop, and phosphorothioate bonds. FIG. 1 C provides a schematic depicting a degradable bubble with a self- complementary byproduct (RB). FIG. ID provides exemplary sequences of de-phased stem regions that may be present in adaptors of the invention.
III. Methods of Use
[0091] Double-stranded nucleic acid adaptors and stem-loop oligonucleotides can be used as adaptors for preparing libraries for whole genome or whole transcriptome amplification for PCR analysis, microarray analysis, conventional Sanger or next generation sequencing, e.g., as described in U.S. Pat. No. 7,803,550. In specific embodiments, a whole genome is amplified from a single cell.
[0092] Accordingly, in some embodiments there is provided a method of preparing a library of nucleic acid molecules. For example, libraries generated by DNA fragmentation and addition of a stem-loop adaptor to one or both DNA ends may be used to amplify (by PCR) and sequence DNA regions adjacent to a previously established DNA sequence (see, for example, U.S. Patent No. 6,777,187 and references therein, all of which are incorporated by reference herein in their entirety). The double-stranded nucleic acid adaptor can be ligated to the 5' end, the 3' end, or both strands of DNA. [0093] In some embodiments, a plurality of nucleic acid molecules are amplified and sequenced by ligating the plurality of nucleic acid molecules to a population, e.g., as described above, of double-stranded nucleic acid adaptors. One method comprises obtaining a population of target nucleic acid molecules and attaching at least one end of a double-stranded nucleic acid adaptor to at least one end of the target nucleic acid molecule and displacing one strand of the adaptor bound oligonucleotide by strand displacement or nick translation. In some aspects, a bud stem-loop adaptor comprising a MIS is ligated to both ends of the target nucleic acid. In other aspects, a bud stem-loop adaptor comprising a MIS is ligated to one end of the target nucleic acid and a stem-loop adaptor not comprising a MIS is ligated to the other end of the target nucleic acid. The two adaptors ligated to each end of a target nucleic acid may comprise part or all of a first sequencing primer sequence or a second sequencing primer sequence, such that an adaptor on one end has part or all of a first sequencing primer sequence and an adaptor on the other end has part or all of a second sequencing primer sequence. The adaptor may be ligated to one strand (i.e., single-stranded ligation) or to both strands (i.e., double-stranded ligation) of the target nucleic acid.
[0094] In some aspects, the target nucleic acid is a double-stranded DNA molecule. The double-stranded DNA may be any type of DNA (or sub-type thereof) including, but not limited to, genomic DNA (e.g., prokaryotic genomic DNA (e.g., bacterial genomic DNA, archaea genomic DNA, etc.), eukaryotic genomic DNA (e.g., plant genomic DNA, fungi genomic DNA, animal genomic DNA (e.g., mammalian genomic DNA (e.g., human genomic DNA, rodent genomic DNA (e.g., mouse, rat, etc.), etc.), insect genomic DNA (e.g., drosophila), amphibian genomic DNA (e.g., Xenopus), etc.)), viral genomic DNA, mitochondrial DNA, cell-free DNA, such as NIPT DNA, including fetal and/or maternal cell free DNA, or any combination of DNA types thereof or subtypes thereof. Accordingly, in some aspects the method comprises attaching an adaptor to complementary single strands of the double-stranded DNA molecule. Generally, the plurality of genomic DNA molecules are enzymatically digested or randomly fragmented to produce DNA fragments, a MIS stem-loop adaptor is ligated to at least one end of a plurality of the DNA fragments to produce adaptor-linked fragments, and the adaptor-linked fragments are then amplified. In other aspects, the target nucleic acids are isolated from cell-free DNA (cfDNA), e.g., where the DNA is an NIPT DNA sample. In some aspects, the isolated cfDNA may comprise fragments (e.g., of about 50 to 200 bp, particularly about 167 bp in length) and not need a fragmentation step prior to library preparation.
[0095] A MIS double-stranded nucleic acid adaptor may be coupled to one end of a target nucleic acid molecule or to both ends of a target nucleic acid molecule. The double-stranded nucleic acid adaptor may be coupled to the nucleic acid molecule via ligation to the 5' end of the nucleic acid molecule, for example, by blunt-end ligation. Ligating the double-stranded nucleic acid adaptor to one or both ends of a target nucleic acid molecule may result in nick formation. Said one or more nicks may be removed from the ligated double-stranded nucleic acid adaptor and the nucleic acid target molecule. In some aspects, the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxy 1 group and strand displacement or nick translation polymerization is performed to extend the nucleic acid molecules to the adaptor. The polymerization may cease at a non-replicable base, such as within the gap region. In certain aspects, wherein the double-stranded nucleic acid adaptor comprises a stem-loop structure, polymerization may cease in the region between loops, and/or the main loop. Thus, an extension reaction may extend the 3' end of the nucleic acid molecule through the stem-loop adaptor where the loop portion is cleaved at a cleavable replication stop.
[0096] In some embodiments, methods of the present invention utilize a strand-displacing polymerase, such as Φ29 Polymerase, Bst Polymerase, Vent Polymerase, 9oNm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, AMV reverse transcriptase, HIV reverse transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 '-5 ' exonuclease activity, or a mixture thereof.
A. Target Nucleic Acid Molecules
[0097] Nucleic acids in a nucleic acid sample being analyzed (or processed) in accordance with the present invention can be from any nucleic acid source, e.g., as described above. As such, nucleic acids in a nucleic acid sample can be from virtually any nucleic acid source, including but not limited to genomic DNA, complementary DNA (cDNA), RNA (e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.), plasmid DNA, mitochondrial DNA, cfDNA, etc. Furthermore, as any organism can be used as a source of nucleic acids to be processed in accordance with the present invention, no limitation in that regard is intended. Exemplary organisms include, but are not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), bacteria, fungi (e.g., yeast), viruses, etc. In certain embodiments, the nucleic acids in the nucleic acid sample are derived from a mammal, where in certain embodiments the mammal is a human. A nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules. Also, a nucleic acid molecule of interest can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, amplified DNA, a pre-existing nucleic acid library, etc.
[0098] A nucleic acid molecule of interest may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, chemical, enzymatic, degradation over time, etc. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation / demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.
[0099] In the case of fragmented DNA (for example, cell-free DNA from blood and/or urine) the reaction may or may s not use a fragmentation step.
[00100] In some aspects, the plurality of nucleic acid molecules comprises nucleic acid fragments, such as gDNA subject to fragmentation. In some aspects, the shear force may be a hydrodynamic shear force, such as those generated by acoustic or mechanical means. Hydrodynamic shearing of a nucleic acid can occur by any method known in the art, including passing the nucleic acid through a narrow capillary or orifice, referred to as "point-sink" shearing (Oefner et al, 1996; Thorstenson et al, 1998: Quail, 2010), acoustic shearing, or sonication. The commercially available focused-ultrasonicators, in conjunction with miniTUBEs or microTUBEs (Covaris, Woburn, MA; U.S. Patent Nos. 8,459,121; 8,353,619; 8,263,005; 7,981,368; 7,757,561), can randomly fragment DNA with distributions centered between 2-5 kb and 0.1-1.5 kb, respectively. Sonication subjects nucleic acid to hydrodynamic shearing forces (Grokhovsky, 2006; Sambrook et al, 2006). For example, the commercially available Bioruptor (Diagenode; Denville, NJ; U.S. Patent Publn. No. 2012/0264228) use sonication to shear nucleic acids.
[00101] In certain aspects, a nucleic acid fragment may have a size of about 100 bp, about 200 bp, about 300 bp, about 400 bp, about 500 bp, about 1000 bp, or about 2000 bp. In certain aspects, the nucleic acid fragments may have an average size of about 100 bp, about 200 bp, about 300 bp, about 400 bp, about 500 bp, about 1000 bp, or about 2000 bp. In certain aspects, a nucleic acid molecule may have a size of about 2000 bp, 5000 bp, 7500 bp, 10,000 bp, 20,000 bp, 30,000 bp, 40,000 bp, 50,000 bp, 60,000 bp, 70,000 bp, 80,000 bp, 90,000 bp, or 100,000 bp. Nucleic acids may be, for example, RNA or DNA. Modified forms of RNA or DNA may also be used.
[00102] Where desired, a given protocol according to embodiments of the invention may include a pooling step, e.g., where a first adaptor ligated composition is combined or pooled with the one or more additional adaptor ligated compositions. As such, incertain embodiments, nucleic acid fragments tagged according to aspects of the subject invention are pooled with nucleic acid fragments derived from a plurality of sources (e.g., a plurality of organisms, tissues, cells, or subjects), where by "plurality" is meant two or more. The number of different tagged compositions produced from different sources that are combined or pooled in such embodiments may vary, where the number ranges in some instances from 2 to 50, such as 3 to 25, including 4 to 20 or 10,000, or more. Prior to or after pooling, the different tagged compositions can be amplified, e.g., by polymerase chain reaction (PCR), such as described above. [00103] The RNA molecule may be obtained from a sample, such as a sample comprising total cellular RNA, a transcriptome, or both; the sample may be obtained from one or more viruses; from one or more bacteria; or from a mixture of animal cells, bacteria, and/or viruses, for example. The sample may comprise mRNA, such as mRNA that is obtained by affinity capture.
[00104] Obtaining nucleic acid molecules may comprise generation of the cDNA molecule by reverse transcribing the mRNA molecule with a reverse transcriptase, such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof.
B. Amplification
[00105] A number of template-dependent processes are available to amplify the nucleic acids present in a given template sample. One of the best known amplification methods is the polymerase chain reaction (referred to as PCR™) which is described in detail in U.S. Patent Nos. 4,683,195, 4,683,202, and 4,800,159 and in Innis et al, 1990, each of which is incorporated herein by reference in their entirety. Briefly, two synthetic oligonucleotide primers, which are complementary to two regions of the template DNA (one for each strand) to be amplified, are added to the template DNA (that need not be pure), in the presence of excess deoxynucleotides (dNTP's) and a thermostable polymerase, such as, for example, Taq (Thermus aquaticus) DNA polymerase. In a series (typically 30-35) of temperature cycles, the target DNA is repeatedly denatured (around 90°C), annealed to the primers (typically at 50-60°C) and a daughter strand extended from the primers (72°C). As the daughter strands are created they act as templates in subsequent cycles. Thus, the template region between the two primers is amplified exponentially, rather than linearly.
[00106] A second barcode, such as a sample barcode, may be added to the target nucleic acid molecules during amplification. One method {e.g., described in PCT/US2013/068468, incorporated herein by reference) involves annealing a primer to the first barcoded nucleic acid molecule, the primer including a first portion complementary to the first barcoded nucleic acid molecule and a second portion including a second barcode; and extending the annealed primer to form a dual barcoded nucleic acid molecule, the dual barcoded nucleic acid molecule including the second barcode, the first barcode, and at least a portion of the nucleic acid molecule. Thus, the primer may include a 3' portion and a 5' portion, where the 3' portion may anneal to a portion of the first barcode and the 5' portion comprises the second barcode.
C. Sequencing
[00107] Methods are also provided for the sequencing of the library of adaptor-linked fragments. Any technique for sequencing nucleic acids known to those skilled in the art can be used in the methods of the present disclosure. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by-synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing-by-synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.
[00108] The nucleic acid library may be generated with an approach compatible with
Illumina sequencing such as a Nextera™ DNA sample prep kit, and additional approaches for generating Illumina next-generation sequencing library preparation are described, e.g., in Oyola et al. (2012). In other embodiments, a nucleic acid library is generated with a method compatible with a SOLiD™ or Ion Torrent sequencing method (e.g. , a SOLiD® Fragment Library Construction Kit, a SOLiD® Mate-Paired Library Construction Kit, SOLiD® ChlP-Seq Kit, a SOLiD® Total RNA-Seq Kit, a SOLiD® SAGE™ Kit, a Ambion® RNA-Seq Library Construction Kit, etc.). Additional methods for next-generation sequencing methods, including various methods for library construction that may be used with embodiments of the present invention are described, e.g., in Pareek (2011) and Thudi (2012).
[00109] In particular aspects, the sequencing technologies used in the methods of the present disclosure include the HiSeq™ system (e.g., HiSeq™ 2000 and HiSeq™ 1000) and the MiSeq™ system from Illumina, Inc. The HiSeq™ system is based on massively parallel sequencing of millions of fragments using attachment of randomly fragmented genomic DNA to a planar, optically transparent surface and solid phase amplification to create a high density sequencing flow cell with millions of clusters, each containing about 1 ,000 copies of template per sq. cm. These templates are sequenced using four-color DNA sequencing-by-synthesis technology. The MiSeq™ system uses TruSeq™, Illumina's reversible terminator-based sequencing-by-synthesis.
[00110] Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is 454 sequencing (Roche) (Margulies et al., 2005). 454 sequencing involves two steps. In the first step, DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended. Oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors serve as primers for amplification and sequencing of the fragments. The fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5'-biotin tag. The fragments attached to the beads are PCR amplified within droplets of an oil- water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead. In the second step, the beads are captured in wells (pico-liter sized). Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated.
[00111] Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is SOLiD technology (Life Technologies, Inc.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to the 5' and 3' ends of the fragments to generate a fragment library. Alternatively, internal adaptors can be introduced by ligating adaptors to the 5' and 3' ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5' and 3' ends of the resulting fragments to generate a mate-paired library. Next, clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and beads are enriched to separate the beads with extended templates. Templates on the selected beads are subjected to a 3' modification that permits bonding to a glass slide.
[00112] Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is the IonTorrent system (Life Technologies, Inc.). Ion Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way. Each well holds a different DNA template. Beneath the wells is an ion-sensitive layer and beneath that a proprietary Ion sensor. If a nucleotide, for example a C, is added to a DNA template and is then incorporated into a strand of DNA, a hydrogen ion will be released. The charge from that ion will change the pH of the solution, which can be detected by the proprietary ion sensor. The sequencer will call the base, going directly from chemical information to digital information. The Ion Personal Genome Machine (PGM™) sequencer then sequentially floods the chip with one nucleotide after another. If the next nucleotide that floods the chip is not a match, no voltage change will be recorded and no base will be called. If there are two identical bases on the DNA strand, the voltage will be double, and the chip will record two identical bases called. Because this is direct detection— no scanning, no cameras, no light— each nucleotide incorporation is recorded in seconds.
[00113] Another example of a sequencing technology that can be used in the methods of the present disclosure includes the single molecule, real-time (SMRT™) technology of Pacific Biosciences. In SMRT™, each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked. A single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW). A ZMW is a confinement structure which enables observation of incorporation of a single nucleotide by DNA polymerase against the background of fluorescent nucleotides that rapidly diffuse in and out of the ZMW (in microseconds). It takes several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.
[00114] A further sequencing platform includes the CGA Platform (Complete
Genomics). The CGA technology is based on preparation of circular DNA libraries and rolling circle amplification (RCA) to generate DNA nanoballs that are arrayed on a solid support (Drmanac et al. 2010). Complete genomics' CGA Platform uses a novel strategy called combinatorial probe anchor ligation (cPAL) for sequencing. The process begins by hybridization between an anchor molecule and one of the unique adapters. Four degenerate 9-mer oligonucleotides are labeled with specific fluorophores that correspond to a specific nucleotide (A, C, G, or T) in the first position of the probe. Sequence determination occurs in a reaction where the correct matching probe is hybridized to a template and ligated to the anchor using T4 DNA ligase. After imaging of the ligated products, the ligated anchor-probe molecules are denatured. The process of hybridization, ligation, imaging, and denaturing is repeated five times using new sets of fluorescently labeled 9-mer probes that contain known bases at the n + 1, n + 2, n + 3, and n + 4 positions.
D. Illustrative Method
[00115] FIG. 2 provides a schematic depiction of a process for library construction according to an embodiment of the invention. As shown in FIG. 2, fragmented double stranded genomic DNA starting is combined with a population of non-degradable bud adaptors each comprising an MIS. The number of distinct bud adaptors in the population is 4<6+6) = 16.8 x 106. Following ligation, the 3' end of the genomic DNA is extended along the stem of the bud adaptor, through the MIS containing bud, until it reaches a non-replicable base in the loop. Resultant primer binding sites initial provided in the non-ligateable stem region of the adaptors are then employed to amplify the DNA, where the amplified DNA includes sample barcode and P5/P7 domains, e.g., for Illumina NGS.
IV. Kits of the Present Invention
[00116] The technology herein includes kits for creating libraries of target nucleic acids in a sample. A "kit" refers to a combination of physical elements. For example, a kit may include, for example, one or more components such as double-stranded nucleic acid adaptors or stem- loop adaptors, including without limitation specific primers, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the invention.
[00117] The kit may further comprise a polymerase, such as a strand displacing polymerase, including, for example, Φ29 Polymerase, Bst Polymerase, Vent Polymerase, 9°Nm Polymerase, Klenow fragment of DNA Polymerase I, MMLV Reverse Transcriptase, a mutant form of T7 phage DNA polymerase that lacks 3 ' -5 ' exonuclease activity, or a mixture thereof.
[00118] The components of the kits may be packaged either in aqueous media or in lyophilized form. The container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g., aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial. The kits of the present invention also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained.
[00119] A kit will also include instructions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented. It is contemplated that such reagents are embodiments of kits of the invention. Such kits, however, are not limited to the particular items identified above and may include any reagent used for the manipulation or characterization of the methylation of a gene.
[00120] The container means of the kits will generally include at least one vial, test tube, flask, bottle, or other container means, into which a component may be placed, and preferably, suitably aliquoted. Where there is more than one component in the kit, the kit also will generally contain additional containers into which the additional components may be separately placed. However, various combinations of components may be comprised in a container. The kits of the present invention also will typically include a means for packaging the component containers in close confinement for commercial sale. Such packaging may include injection or blow-molded plastic containers into which the desired component containers are retained.
V. Examples
[00121] The following examples are included to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention. Example 1 - Production and Evaluation of MIS Library using Double Stranded Nucleic Acid
Adaptors
[00122] Libraries were prepared from both individual and pooled plasma samples obtained from donors. Cell-free DNA was isolated from the pooled plasma samples using the Qiagen QIAamp Circulating Nucleic Acid kit. Libraries were prepared as in the ThruPLEX® Plasma-seq Kit (Rubicon Genomics®), including repairing the cfDNA to produce molecules with blunt ends, with the difference being ligation of the stem-loop adaptors depicted in FIG. 1 A to the 5' end of the cfDNA, leaving a nick at the 3' end of the target fragment. Next, the 3' ends of the cfDNA were extended to complete library synthesis and Illumina-compatible indexes were added by amplification. The library was then processed and sequenced on the Illumina MiSeq, NextSeq500 on both mid- and high-output flow cells, as well as the HiSeq2500 and HiSeq3000. Sequencing data was generated using PicardTools.
[00123] Each of the three adaptor designs were evaluated using sequencing analysis metrics, particularly the percentage of unmapped reads. While the non-degradable bubble (BB) design (FIG. 1A) was found to lose diversity due to collapse of the structure, the non-degradable bud (YB) design (FIG. 1C) was shown to have a significant reduction in the percentage of unmapped reads (FIG. 3 A). Further, the bud adaptor design with a 5' phosphorothioate bond in addition the 3' phosphorothioate bond (5PTYB-idSp) showed a significant reduction in the percentage of unmapped reads compared to the bud adaptor design with only the 3' phosphorothioate (YBidSp). Therefore, the bud adaptor design (YB-5PT-idSp) with 2 abasic sites in the main loop, phosphorothioate modification on the last 2 of the 3' bases, and the 1st of the 5' base to deter adapter dimer formation was used for the subsequent studies.
[00124] The next sequencing runs were performed to evaluate the variable (i.e., dephased stems) versus the standard stems. The runs were performed on 0.5 ng of input DNA, no PhiX was included, and the data was untrimmed. The sequencing run with the standard stems failed on the NextSeq500 due to a lack of diversity in the stems. In contrast, a reduced Q score in the stem region showed that the sequencing run with two stem lengths resolved the issue (FIG. 3B). The increased percentage of unmapped reads reflects the elevated NTC. [00125] Table 1 : Sequence of the bud adaptors with dephased stems.
Figure imgf000035_0001
idSp=internal l ',2'-Deoxyribose (dSpacer)
*=phosphorothioate bond
[00126] The bud stem -loop adaptors with dephased stems of 8, 10, and 12 nucleotides were ligated to a 0.5 ng pooled plasma input DNA and amplified for 14 cycles. All of the adaptors produced similar amplification results with the 10 or 8 bp stem adaptors amplifying slightly better than the 12 bp stem adaptors (FIG. 4) possibly due to better strand displacement during PCR. All of the bud adaptors showed a nice delta Ct between the samples and NTC libraries.
[00127] Finally, sequencing of the bud adaptors with variable stems (Table 1) was performed with the key metric being the Estimated Library Size (ELS) to measure the diversity of the library. The results showed that the bud adaptors with the dephased stems and molecular identifiers of 6 bp had an increased ELS (FIGS. 5 A to 5C). Thus, the bud adaptors with the molecular identifier, 5' phosphorothioate bond, and dephased stems can be used to produce a diverse library for next generation sequencing.
* * *
[00128] Notwithstanding the appended claims, the disclosure is also defined by the following clauses:
1. A population of double-stranded nucleic acid adaptors for ligating to a population of nucleic acid target molecules, the double-stranded adaptors comprising:
(a) a ligateable stem region having a terminal 5' end strand and a terminal 3' end strand;
(b) a non-ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; and
(c) an asymmetric loop region between the ligateable stem region and the non-ligateable stem region, wherein the asymmetric loop region comprises a molecular identification sequence (MIS).
2. The population of clause 1, wherein the double-stranded adaptors are further defined as a single-stranded nucleic acid molecule that under ligation conditions forms a stem-loop adaptor having a distal loop region attached to the non-ligateable stem region.
3. The population of clause 1 or 2, wherein the non-ligateable stem region comprises a primer binding site.
4. The population of clause 3, wherein the double-stranded adaptors within the population comprise a mixture of double-stranded adaptors with a first primer binding site and double-stranded adaptors with a second primer binding site. 5. The population of any of the preceding clauses, wherein the ligateable stem region further comprises a variable stem region defined as a region whose length varies among the members of the population.
6. The population of clause 5, wherein said molecular identification sequence is unique to a subset of the population.
7. The population of clause 5, wherein said molecular identification sequence is degenerate to a subset of the population.
8. The population of clause 5, wherein said molecular identification sequence is unique to a subset of the population and degenerate to another subset of the population.
9. The population of any of the preceding clauses, wherein the non-ligateable stem region comprises one or more mismatched bases.
10. The population of any of the preceding clauses, wherein the asymmetric loop region is formed between the terminal 3' end strand of the ligateable stem region and the terminal 5' end strand of the non-ligateable stem region.
11. The population of any of clauses 1 to 9, wherein the asymmetric loop region is formed between the terminal 5' end strand of the ligateable stem region and the terminal 3' end strand of the non-ligateable stem region.
12. The population of any of the preceding clauses, wherein said double -stranded nucleic acid adaptors comprise DNA.
13. The population of any of the preceding clauses, wherein said double -stranded nucleic acid adaptors comprise RNA.
14. The population of any of the preceding clauses, wherein said double-stranded nucleic acid adaptors comprise DNA and RNA.
15. The population of any of the preceding clauses, wherein the population of nucleic acid target molecules comprises genomic DNA, fragmented DNA, cDNA, amplified DNA, or a nucleic acid library.
16. The population of any of the preceding clauses, wherein the distal loop region comprises a non-replicable base.
17. The population of clause 16, wherein said non-replicable base comprises an abasic site.
18. The population of clause 17, wherein said abasic site comprises an l',2'-dideoxyribose.
19. The population of clause 16, wherein said non-replicable base comprises a deoxyuridine or a ribonucleotide base. 20. The population of any of the preceding clauses, wherein a gap region on the strand opposite the asymmetric loop region has a length of at least one nucleotide shorter than that of the asymmetric loop region.
21. The population of clause 20, wherein the length of the gap region is at least 2 nucleotides shorter than that of the asymmetric loop region.
22. The population of clause 20, wherein the length of the gap region is less than 5 nucleotides.
23. The population of clause 20, wherein the length of the gap region is 1 nucleotide.
24. The population of clause 20, wherein the gap region has a length of a bond between two adjacent nucleotides.
25. The population of any of the preceding clauses, wherein the asymmetric loop region is 4 to 16 nucleotides in length.
26. The population of any of the preceding clauses, wherein the asymmetric loop region is 5 to 8 nucleotides in length.
27. The population of any of the preceding clauses, wherein the asymmetric loop region is 6 nucleotides in length.
28. The population of any of clauses 20 to 27, wherein the gap region comprises a spacer incapable of base-pairing.
29. The population of clause 28, wherein the spacer comprises an abasic site.
30. The population of clause 29, wherein the abasic site comprises an l',2'-dideoxyribose.
31. The population of any of the preceding clauses, wherein said molecular identification sequence comprises 5-10 nucleotides.
32. The population of any of the preceding clauses, wherein said molecular identification sequence comprises 6, 7, or 8 nucleotides.
33. The population of any of the preceding clauses, wherein said molecular identification sequence comprises 6 nucleotides.
34. The population of any of the preceding clauses, wherein said molecular identification sequence is unique throughout the population.
35. The population of any of the preceding clauses, wherein said molecular identification sequence is partially degenerate within the population.
36. The population of any of the preceding clauses, wherein said molecular identification sequence is degenerate within the population. 37. The population of any of clauses 5 to 35, wherein said variable stem comprises 4-15 nucleotides.
38. The population of clause 37, wherein said variable stem comprises 8-11 nucleotides.
39. The population of clause 38, wherein the variable stem comprises 8, 9, 10, or 11 nucleotides.
40. The population of any of the preceding clauses, wherein a 5' terminal end and/or 3' terminal end of the ligateable stem comprise nucleotides having phosphorothioate linkages.
41. The population of any of the preceding clauses, wherein a 5' terminal end of the ligateable stem comprises a ligation block.
42. The population of clause 41, wherein the ligation block is a dephosphorylated nucleotide, a 5' hydroxyl group, a dideoxy nucleotide, or an inverted dT.
43. The population of any of the preceding clauses, wherein the ligateable stem region further comprises one or more replication blocks or cleavable bases between a 3' terminal end or the 5' terminal end and the asymmetric loop or gap region.
44. The population of any of the preceding clauses, wherein the non-ligateable stem region further comprises one or more replication blocks or cleavable bases between the asymmetric loop or gap region and the distal loop region.
45. The population of clause 43 or 44, wherein the cleavable base is inosine, uracil, or ribonucleotide.
46. A method for producing a library of adaptor-bound target nucleic acids comprising:
(a) providing a population of target nucleic acid molecules and attaching to each end a double-stranded nucleic acid adaptor according to any one of clauses 1-45, thereby generating a population of adapter-bound target nucleic acid molecules; and
(b) replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation to make an exact copy of the asymmetric loop.
47. The method of clause 46, wherein the population of adaptor-bound target nucleic acid molecules comprise a first double-stranded nucleic acid adaptor with a first primer binding site attached on one end and a second double-stranded nucleic acid adaptor with a second primer binding site attached on the other end.
48. The method of clauses 46 or 47, wherein attaching is further defined as ligating.
49. The method of clause 48, wherein attaching is further defined as double strand ligation.
50. The method of clause 48, wherein attaching is further defined as single strand ligation. 51. The method of clause 46, wherein attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands.
52. The method of clause 46, wherein attaching is further defined as blunt end ligation
53. The method of clause 46, wherein attaching is further defined as ligation to an overhang.
54. The method of clause 46, wherein the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
55. The method of clause 46, wherein the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem region adjacent to the asymmetric loop.
56. A library of adaptor-bound target nucleic acids produced by the method of clause 46.
57. A method for producing a library of adaptor-bound target nucleic acids comprising:
(a) providing a population of target nucleic acid molecules and attaching to one end a double-stranded nucleic acid adaptor according to any one of clauses 1-45, thereby generating a population of adapter-bound target nucleic acid molecules; and
(b) displacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation, such that a complementary copy of the asymmetric loop is incorporated into the replaced strand.
58. The method of clause 57, wherein attaching is further defined as ligating.
59. The method of clause 57, wherein attaching is further defined as double strand ligation.
60. The method of clause 57, wherein attaching is further defined as single strand ligation,
61. The method of clause 56, wherein attaching is further defined as ligating a double-stranded nucleic acid adaptor to complementary single strands.
62. The method of clause 57, wherein attaching is further defined as blunt end ligation
63. The method of clause 57, wherein attaching is further defined as ligation to an overhang.
64. The method of clause 57, wherein the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
65. The method of clause 57, wherein the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the asymmetric loop or in a region of the non-ligateable stem adjacent to the asymmetric loop.
66. A library of adaptor-bound target nucleic acids produced by the method of clause 57.
67. A method for producing a library of adaptor-bound target nucleic acids comprising: (a) providing a population of target nucleic acid molecules, attaching to one end a first double-stranded nucleic acid adaptor according to any one of clauses 1-45, and attaching to the other end a second double-stranded nucleic acid adaptor optionally comprising a MIS, thereby generating a population of adapter-bound target nucleic acid molecules; and
(b) replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation such that a complementary copy of the second strand is incorporated into the replaced strand.
68. The method of clause 67, wherein the second double-stranded nucleic acid adaptor does not comprise a MIS.
69. The method of clause 68, wherein the second double-stranded nucleic acid adaptor does not comprise an asymmetric loop.
70. The method of clause 67, wherein the first double-stranded nucleic acid adaptor and/or second double-stranded nucleic acid adaptor are stem-loop adaptors.
71. The method of clause 70, wherein the first double-stranded nucleic acid adaptor comprises a first primer binding site in the non-ligateable stem region and the second double-stranded nucleic acid adaptor comprises a second primer binding site in the non-ligateable stem region.
72. The method of clause 67, wherein attaching is further defined as ligating.
73. The method of clause 67, wherein attaching is further defined as double strand ligation.
74. The method of clause 67, wherein attaching is further defined as single strand ligation,
75. The method of clause 66, wherein attaching is further defined as ligating a stem-loop nucleic acid adaptor to complementary single strands.
76. The method of clause 67, wherein attaching is further defined as blunt end ligation
77. The method of clause 67, wherein attaching is further defined as ligation to an overhang
78. The method of clause 67, wherein the adaptor-bound nucleic acid molecule comprises a nick having a 3' hydroxyl group.
79. The method of clause 67, wherein the strand displacement or nick translation polymerization is further defined as polymerization that ceases at a non-replicable base or region in the loop or in a region of the stem adjacent to the loop.
80. The method of clause 67, further comprising preventing MIS switching.
81. The method of clause 80, wherein the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors are contacted with terminal deoxyribonucleotidyl transferase (TdT). 82. The method of clause 80, PCR purification is performed on the adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors.
83. The method of clause 80, wherein the first or second double-stranded nucleic acid adaptors comprise one or more uracils within the ligateable stem region.
84. The method of clause 83, further comprising contacting the library of adaptor-bound target nucleic acids and excess double-stranded nucleic acid adaptors with USER enzyme.
85. The method of clause 83, further comprising contacting the library and excess adaptors with exonuclease I and incubating at a non-denaturing temperature prior to performing PCR.
[00129] Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it is readily apparent to those of ordinary skill in the art in light of the teachings of this invention that certain changes and modifications may be made thereto without departing from the spirit or scope of the appended claims.
[00130] Accordingly, the preceding merely illustrates the principles of the invention.
It will be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure. The scope of the present invention, therefore, is not intended to be limited to the exemplary embodiments shown and described herein. Rather, the scope and spirit of present invention is embodied by the appended claims. REFERENCES
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
Duncan, DNA Glycosylases, In: The Enzymes, XIV:565-586, 1981.
Grokhovsky, Specificity of DNA cleavage by ultrasound, Mol. Biol, 40:276-283, 2006.
Innis et al. eds., PCR Protocols A Guide to Methods and Applications, Academic Press Inc., San
Diego, Calif, 1990.
International Publication No. PCT/US2013/068468
Lundberg et al, Nature Methods 10:999-1002, 2013.
Margulies et al. Nature, 437, 376-380, 2005.
McPherson et al, editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995).
Oefner et al, Nucleic Acid Research, 24: 3879-3886, 1996.
Pareek et al, Sequencing technologies and genome sequencing, J. Appl. Genet, 52(4):413-435, 2011. Quail, DNA: Mechanical Breakage, In: Encyclopedia of Life Sciences (ELS). John Wiley & Sons,
Ltd., Chichester, pp. 1-5, 2010.
Sambrook and Russell, Fragmentation of DNA by nebulization, Cold Spring Harb. Protoc. , 2006. Sambrook et al, "Molecular Cloning," A Laboratory Manual, 2d Ed., Cold Spring Harbor Laboratory Press, New York, 13.7-13.9: 1989.
Thorstenson et al. , Genome Research, 8: 848-855, 1998.
Thudi et al, Current state-of-art of sequencing technologies for plant genomics research, Brief Funct.
Genomics, 11(1):3-11, 2012.
Wu et al, BMC Microbiology 15: 125, 2015
U.S. Patent No. 4,683,195 U.S. Patent No. 7,981,368
U.S. Patent No. 4,683,202 U.S. Patent No. 8,263,005
U.S. Patent No. 4,800,159 U.S. Patent No. 8,353,619
U.S. Patent No. 4,873,192 U.S. Patent No. 8,459,121
U.S. Patent No. 5,858,656 U.S. Patent Publication No. 2007/0020640 U.S. Patent No. 5,935,793 U.S. Patent Publication No. 2009/0068645 U.S. Patent No. 6,261,782 U.S. Patent Publication No. 2010/0227329 U.S. Patent No. 6,777,187 U.S. Patent Publication No. 2010/0273219 U.S. Patent No. 7,435,572 U.S. Patent Publication No. 2011/0015096 U.S. Patent No. 7,757,561 U.S. Patent Publication No. 2011/0257031 U.S. Patent No. 7,803,550 U.S. Patent Publication No. 2011/0319290 U.S. Patent No. 7,803,550 U.S. Patent Publication No. 2012/0028814 U.S. Patent No. 7,902,122 U.S. Patent Publication No. 2012/0264228.

Claims

WHAT IS CLAIMED IS:
1. A population of double-stranded nucleic acid adaptors for ligating to a population of nucleic acid target molecules, the double-stranded adaptors comprising:
(a) a ligateable stem region having a terminal 5' end strand and a terminal 3' end strand;
(b) a non-ligateable stem region having a terminal 5' end strand and a terminal 3' end strand; and
(c) an asymmetric loop region between the ligateable stem region and the non- ligateable stem region, wherein the asymmetric loop region comprises a molecular identification sequence (MIS).
2. The population of claim 1, wherein the double-stranded adaptors are further defined as a single-stranded nucleic acid molecule that under ligation conditions forms a stem-loop adaptor having a distal loop region attached to the non-ligateable stem region.
3. The population of claim 1 or 2, wherein the non-ligateable stem region comprises a primer binding site.
4. The population of claim 3, wherein the double-stranded adaptors within the population comprise a mixture of double-stranded adaptors with a first primer binding site and double-stranded adaptors with a second primer binding site.
5. The population of any of the preceding claims, wherein the ligateable stem region further comprises a variable stem region defined as a region whose length varies among the members of the population.
6. The population of any of the preceding claims, wherein the non-ligateable stem region comprises one or more mismatched bases.
7. The population of any of claims 2 to 6, wherein the distal loop region comprises a non-replicable base.
8. The population of any of the preceding claims, wherein a gap region on the strand opposite the asymmetric loop region has a length of at least one nucleotide shorter than that of the asymmetric loop region.
9. The population of any of the preceding claims, wherein the asymmetric loop region is 4 to 16 nucleotides in length.
10. The population of any of the preceding claims, wherein the molecular identification sequence comprises 5-10 nucleotides.
1 1. The population of any of the preceding claims, wherein one strand of the ligateable stem region comprises a terminal ligation block.
12. The population of any of the preceding claims, wherein said molecular identification sequence is unique throughout the population.
13. A method for producing a library of adaptor-bound target nucleic acids comprising:
(a) providing a population of target nucleic acid molecules and attaching to each end a double-stranded nucleic acid adaptor according to any one of claims 1-12, thereby generating a population of adapter-bound target nucleic acid molecules; and
(b) replacing one strand of the adaptor-bound target nucleic acid molecules by strand displacement or nick translation to make an exact copy of the asymmetric loop.
14. The method of claim 13, wherein attaching is further defined as ligating.
15. A library of adaptor-bound target nucleic acids produced by the method of any of claims 13 to 14.
16. A double-stranded nucleic acid adaptor according to any one of claims 1-12.
PCT/US2017/045976 2016-08-09 2017-08-08 Nucleic acid adaptors with molecular identification sequences and use thereof WO2018031588A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662372543P 2016-08-09 2016-08-09
US62/372,543 2016-08-09

Publications (1)

Publication Number Publication Date
WO2018031588A1 true WO2018031588A1 (en) 2018-02-15

Family

ID=61162494

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/045976 WO2018031588A1 (en) 2016-08-09 2017-08-08 Nucleic acid adaptors with molecular identification sequences and use thereof

Country Status (1)

Country Link
WO (1) WO2018031588A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113166742A (en) * 2018-10-24 2021-07-23 华盛顿大学 Methods and kits for depleting and enriching nucleic acid sequences
US11332784B2 (en) 2015-12-08 2022-05-17 Twinstrand Biosciences, Inc. Adapters, methods, and compositions for duplex sequencing
US11479807B2 (en) 2017-03-23 2022-10-25 University Of Washington Methods for targeted nucleic acid sequence enrichment with applications to error corrected nucleic acid sequencing
US11739367B2 (en) 2017-11-08 2023-08-29 Twinstrand Biosciences, Inc. Reagents and adapters for nucleic acid sequencing and methods for making such reagents and adapters
CN116676175A (en) * 2023-03-17 2023-09-01 四川大学 Multi-bar code direct RNA nanopore sequencing classifier
US11845985B2 (en) 2018-07-12 2023-12-19 Twinstrand Biosciences, Inc. Methods and reagents for characterizing genomic editing, clonal expansion, and associated applications

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6175002B1 (en) * 1997-04-15 2001-01-16 Lynx Therapeutics, Inc. Adaptor-based sequence analysis
US20070212704A1 (en) * 2005-10-03 2007-09-13 Applera Corporation Compositions, methods, and kits for amplifying nucleic acids
US20120238738A1 (en) * 2010-07-19 2012-09-20 New England Biolabs, Inc. Oligonucleotide Adapters: Compositions and Methods of Use
US20120244525A1 (en) * 2010-07-19 2012-09-27 New England Biolabs, Inc. Oligonucleotide Adapters: Compositions and Methods of Use
WO2015134552A1 (en) * 2014-03-03 2015-09-11 Swift Biosciences, Inc. Enhanced adaptor ligation
US20170211140A1 (en) * 2015-12-08 2017-07-27 Twinstrand Biosciences, Inc. Adapters, methods, and compositions for duplex sequencing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6175002B1 (en) * 1997-04-15 2001-01-16 Lynx Therapeutics, Inc. Adaptor-based sequence analysis
US20070212704A1 (en) * 2005-10-03 2007-09-13 Applera Corporation Compositions, methods, and kits for amplifying nucleic acids
US20120238738A1 (en) * 2010-07-19 2012-09-20 New England Biolabs, Inc. Oligonucleotide Adapters: Compositions and Methods of Use
US20120244525A1 (en) * 2010-07-19 2012-09-27 New England Biolabs, Inc. Oligonucleotide Adapters: Compositions and Methods of Use
WO2015134552A1 (en) * 2014-03-03 2015-09-11 Swift Biosciences, Inc. Enhanced adaptor ligation
US20170211140A1 (en) * 2015-12-08 2017-07-27 Twinstrand Biosciences, Inc. Adapters, methods, and compositions for duplex sequencing

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11332784B2 (en) 2015-12-08 2022-05-17 Twinstrand Biosciences, Inc. Adapters, methods, and compositions for duplex sequencing
US11479807B2 (en) 2017-03-23 2022-10-25 University Of Washington Methods for targeted nucleic acid sequence enrichment with applications to error corrected nucleic acid sequencing
US11739367B2 (en) 2017-11-08 2023-08-29 Twinstrand Biosciences, Inc. Reagents and adapters for nucleic acid sequencing and methods for making such reagents and adapters
US11845985B2 (en) 2018-07-12 2023-12-19 Twinstrand Biosciences, Inc. Methods and reagents for characterizing genomic editing, clonal expansion, and associated applications
CN113166742A (en) * 2018-10-24 2021-07-23 华盛顿大学 Methods and kits for depleting and enriching nucleic acid sequences
CN116676175A (en) * 2023-03-17 2023-09-01 四川大学 Multi-bar code direct RNA nanopore sequencing classifier
CN116676175B (en) * 2023-03-17 2024-04-09 四川大学 Multi-bar code direct RNA nanopore sequencing classifier

Similar Documents

Publication Publication Date Title
US10711269B2 (en) Method for making an asymmetrically-tagged sequencing library
US20220259638A1 (en) Methods and compositions for high throughput sample preparation using double unique dual indexing
US20190005193A1 (en) Digital measurements from targeted sequencing
WO2018031588A1 (en) Nucleic acid adaptors with molecular identification sequences and use thereof
EP3981884A1 (en) Single cell whole genome libraries for methylation sequencing
US20230137106A1 (en) Methods and compositions for paired end sequencing using a single surface primer
CN108138228B (en) High molecular weight DNA sample tracking tag for next generation sequencing
US20230295687A1 (en) Methods and compositions for cluster generation by bridge amplification
US20220098642A1 (en) Quantitative amplicon sequencing for multiplexed copy number variation detection and allele ratio quantitation
US20170175182A1 (en) Transposase-mediated barcoding of fragmented dna
US20220267848A1 (en) Detection and quantification of rare variants with low-depth sequencing via selective allele enrichment or depletion
US20220042100A1 (en) Quantifying foreign dna in low-volume blood samples using snp profiling
WO2021222798A1 (en) Quantitative blocker displacement amplification (qbda) sequencing for calibration-free and multiplexed variant allele frequency quantitation
WO2020227382A1 (en) Sequential sequencing methods and compositions
US20230250470A1 (en) Amplicon comprehensive enrichment
US20230340581A1 (en) Non-extensible oligonucleotides in dna amplification reactions
WO2022256228A1 (en) Method for producing a population of symmetrically barcoded transposomes
WO2023172934A1 (en) Target enrichment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17840167

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17840167

Country of ref document: EP

Kind code of ref document: A1