WO2018057779A1 - Compositions de transposons synthétiques et leurs procédés d'utilisation - Google Patents

Compositions de transposons synthétiques et leurs procédés d'utilisation Download PDF

Info

Publication number
WO2018057779A1
WO2018057779A1 PCT/US2017/052776 US2017052776W WO2018057779A1 WO 2018057779 A1 WO2018057779 A1 WO 2018057779A1 US 2017052776 W US2017052776 W US 2017052776W WO 2018057779 A1 WO2018057779 A1 WO 2018057779A1
Authority
WO
WIPO (PCT)
Prior art keywords
complementary region
strand
synthetic
nucleic acid
sequence
Prior art date
Application number
PCT/US2017/052776
Other languages
English (en)
Inventor
Jianbiao Zheng
Original Assignee
Jianbiao Zheng
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jianbiao Zheng filed Critical Jianbiao Zheng
Publication of WO2018057779A1 publication Critical patent/WO2018057779A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA

Definitions

  • the present invention relates to the field of genomics, in particular, sequencing and analysis of nucleic acids.
  • kits to provide phasing information of sequencing reads that facilitate assembly of whole genome sequences and other long-range sequences.
  • Commercial kits are available from e.g., Complete Genomics, Illumina, or lOx Genomics. Also see, for example, Peters B.A. et al., Nature 487: 190-195, 2012; Kaper F. et al , Proc. Natl. Acad. Sci. 110: 5552-5557, 2013; Amini S. et al , Nature Genetics 46: 1343- 1349, 2014; McCoy R.C. et al , PLOS One 9: el0668, 2014; Zheng G. X. Y.
  • Transposases can be used to introduce mutations or insert sequences in nucleic acids. Previously, transposases were used for in vitro or in vivo mutagenesis (e.g. , US6, 159,736) or for producing protein tags (e.g., US5, 652,128). Several companies including NEB, Epicentre (now part of Illumina) and Finnzymes have provided kits for these purposes. Transposases have also been used to fragment target DNA and to introduce primer binding sequences at the same time. See, for example, US6,593,113, 2003; US9,115,396; US9,145,623; and Adey A. et al , Genome Biol. 11 : R119, 2010. Commercial kits are available, including, for example, NEXTERA ® DNA Sample Prep kits by Illumina/Epicentre and MUSEEK Library Preparation kits by Thermo Scientific.
  • the present invention provides compositions, methods, kits and analysis tools for high- quality sequencing of nucleic acids, haplotyping and quantification of whole genome or targeted sequences.
  • the compositions comprise one or more synthetic transposons having two non- complementary regions linked to each other, and the synthetic transposons may or may not contain molecular barcodes.
  • One aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
  • the cleavable nucleotide is a uracil nucleotide.
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other.
  • the first single-stranded linker and the second single-stranded linker hybridize to each other.
  • each of the first single-stranded linker and the second single- stranded linker comprises a cleavable nucleotide.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the cleavable nucleotide is a uracil nucleotide.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the synthetic transposon further comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • One aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any one of the compositions described above, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • step (c) comprises treating the repaired target nucleic acid with an endonuclease.
  • the endonuclease is uracil DNA glycosylase (UDG).
  • step (c) comprises denaturing of the repaired target nucleic acid.
  • the method further comprises treating the denatured repaired target nucleic acid with an exonuclease.
  • the method further comprises amplifying the library of template nucleic acids.
  • the amplifying is whole-genome amplification.
  • the amplifying is targeted amplification.
  • the library of template nucleic acids is amplified by a polymerase chain reaction (PCR).
  • the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof.
  • the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • the method further comprises circularizing the template nucleic acids prior to the RCA.
  • the polymerase is T4 DNA polymerase.
  • the transposase is Tn5 transposase.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • One aspect of the present application provides a method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of the methods of preparing a library of template nucleic acids described above; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
  • step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis.
  • the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
  • kits for preparing a library of template nucleic acids comprising: (a) the composition according to any one of the compositions described above; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids.
  • the kit further comprises a polymerase, such as a T4 DNA polymerase.
  • the kit further comprises a ligase.
  • the transposase is Tn5 transposase.
  • the kit further comprises an
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • Reference to "about” a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to "about X” includes description of "X”.
  • reference to "not" a value or parameter generally means and describes "other than” a value or parameter.
  • the method is not used to treat cancer of type X means the method is used to treat cancer of types other than X.
  • FIG. 1 illustrates integration of a paired-barcode synthetic transposon having a regular double-stranded structure into template DNA followed by amplification using dual PCR primers
  • F denotes a first adapter sequence
  • R denotes a second adapter sequence.
  • Primers designed to match the F and R sequences i.e., same or reverse complementary sequences are used in the amplification step.
  • FIG. 2A depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
  • FIG. 2B depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single-stranded linker (206) disposed between F and R.
  • FIG. 2C depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205).
  • FIG. 2D depicts an exemplary synthetic transposon fragment comprising a transposase recognition site (201/201rc), a molecular barcode (202/202rc), a stuff sequence (203/203rc), and a non-complementary region comprising adapter sequences F (204) and R (205), and a single- stranded linker (206) disposed between F and R.
  • FIG. 2E depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and no molecular barcodes, wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *).
  • FIG. 2F depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2G depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein only one strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2H depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non- complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 21 depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non-complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2J depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2K depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (206, 206rc) disposed between F and R, wherein the first non- complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers, wherein each single-stranded linker has a cleavable nucleotide (denoted by *), and wherein both molecular barcodes are double-stranded.
  • FIG. 2L depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), and a single-stranded linker (208, 209) disposed between F and R, and a bridge nucleic acid (207), wherein the first non-complementary region is connected to the second non-complementary region via hybridization of the single-stranded linkers to the bridge nucleic acid, wherein each single- stranded linker has a cleavable nucleotide (denoted by *), and wherein one molecular barcode is single-stranded.
  • FIG. 2M depicts an exemplary synthetic transposon comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and
  • FIG. 2N depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein one molecular barcode is single- stranded.
  • a hairpin structure comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 202'rc), stuff sequences (203/203rc, and 2037203
  • FIG. 20 depicts an exemplary synthetic transposon having one blunt end and one end with a hairpin structure (210), comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037203'rc), a first non-complementary region and a second non-complementary region each comprising adapter sequences F (204) and R (205), wherein each strand of the first non- complementary region is fused to one strand of the second non-complementary region via one or more cleavable nucleotides (denoted by *), and wherein both molecular barcode are double- stranded.
  • a hairpin structure comprising two transposase recognition sites (201/201rc, and 2017201 'rc), two molecular barcodes (202/202rc, and 2027202'rc), stuff sequences (203/203rc, and 2037
  • FIG. 3 shows an exemplary method of preparing the synthetic transposons of FIG. 2F and FIG. 2G having two identical molecular barcode sequences.
  • 301/301rc sequences containing transposon binding sites
  • 302/302rc sequences containing molecular barcodes
  • 303/303rc stuff sequences for first priming
  • 304/304rc stuff sequences for second priming
  • 305 fixed sequences that may contain PCR primer 1 or F if needed
  • 306 fixed sequences that may contain PCR primer 2 or R if needed.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • "U” is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • FIG. 4 shows an exemplary method of preparing the synthetic transposons of FIG. 2H and FIG. 21 having two identical molecular barcode sequences.
  • 401/401rc sequences containing transposon binding sites
  • 402/402rc sequences containing molecular barcodes
  • 403/403rc stuff sequences for first priming
  • 404/404rc stuff sequences for second priming
  • 405 fixed sequences that may contain PCR primer 1 or F
  • 406 fixed sequences that may contain PCR primer 2 or R.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • "U” is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • FIG. 5 shows an exemplary method of preparing the synthetic transposons of FIG. 2J and FIG. 2K having two identical molecular barcode sequences.
  • 501/501rc sequences containing transposon binding sites
  • 502/502rc sequences containing molecular barcodes
  • 503/503rc stuff sequences
  • 504/504rc fixed sequences for 1 priming
  • 505/505rc fixed
  • sequences with modified blocked 3 '-end for 2 priming 506: fixed sequence that may contain PCR primer 1 or F; 507/507rc: linker sequences connecting the two non-complementary regions; 508: fixed sequence that may contain PCR primer 2 or R; 509: additional sequence for flexibility of the structures.
  • "*" indicates a cleavable nucleotide or a cleavage site.
  • the 3'-end in 505rc may contain a phosphate group (P) or a reversible dideoxynucleotide.
  • the 3'-phosporyl group can be removed by T4 polynucleotide kinase (T4 PNK) available commercially (e.g. , NEB T4 PNK, catalogue # M0201L).
  • FIG. 6 shows an exemplary method of preparing the synthetic transposons of FIG. 2L and FIG. 2M.
  • 601/601rc sequence containing transposon recognition sites
  • 602/602rc sequence containing molecular barcodes
  • 603/603rc stuff sequences
  • 604/604rc fixed sequences for 1 st priming
  • 605/605rc fixed sequences with blocked 3'-end for 2 nd priming after being deblocked
  • 606 fixed sequence that may contain PCR primer 1 or F
  • 607 linker sequence connecting the first non-complementary region to bridge oligo (607rc+611+610rc)
  • 608 fixed sequence that may contain PCR primer 2 or R
  • 609 additional sequence to provide flexibility of the structure
  • 610 linker sequence connecting the second non-complementary region to bridge oligo
  • FIG. 7 shows an exemplary method of preparing the synthetic transposons of FIG. 2N and FIG. 20.
  • 701/701rc sequences containing transposon binding sites
  • 702/702rc sequences containing molecular barcodes
  • 703/703rc stuff sequences for 1 priming
  • 704/704rc stuff
  • sequences for 2 priming 705: fixed sequences that may contain PCR primer 1 or F; 706: fixed sequences that may contain PCR primer 2 or R.
  • the designation "rc” after a number indicates a reverse complementary sequence.
  • U is a uracil nucleotide and is used as an example for the cleavable nucleotide.
  • "*” denotes another cleavable nucleotide such as an RNA nucleotide.
  • FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes (801 and 802), and an exemplary method of preparing the synthetic transposon.
  • the synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a Uracil nucleotide.
  • FIG. 8B shows an exemplary synthetic transposon of FIG. 2E having two no molecular barcodes, and an exemplary method of preparing the synthetic transposon.
  • the synthetic transposon can be prepared using two DNA oligos having adapter sequences F and R that correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers. On each strand of the synthetic transposon, the F and R sequences are fused to each other via a UU dinucleotide.
  • FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C having a molecule barcode (803), and an exemplary method of preparing the synthetic transposon fragment.
  • the adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
  • FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D comprising a single oligonucleotide forming a hairpin structure that does not contain a molecular barcode, and an exemplary method of preparing the synthetic transposon fragment.
  • the adapter sequences F and R correspond to partial sequences of Readlrc and Read2 of the Illumina sequencing primers.
  • FIG. 8E shows exemplary primers that can be used for multiplexed pair-end sequencing of libraries prepared using the synthetic transposons of FIG. 8A-8D.
  • FIG. 9 shows an exemplary method of preparing a library of template nucleic acids for sequencing using a plurality of synthetic transposons of FIG. 21.
  • the synthetic transposons are integrated into a target DNA, followed by repair and UDG treatment, PCR amplification with dual primers, which may contain a sequence matching F or R, and any additional adapter sequences (shown as dotted lines) needed for sequencing or analysis.
  • FIG. 10 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2G.
  • the synthetic transposons are integrated into a target DNA, followed by repair, UDG treatment, and nick ligation to circularize the nucleic acid fragments, thereby allowing downstream analysis, such as rolling circle amplification (RCA) or single molecule sequencing.
  • RCA rolling circle amplification
  • FIG. 11 shows an exemplary method of preparing a library of template nucleic acids using a plurality of synthetic transposons of FIG. 2M.
  • the synthetic transposons are integrated into a target DNA, followed by repair, denaturation, and exonuclease treatment to provide circular nucleic acid fragments, which can be amplified by rolling circle amplification (RCA), or analyzed by RCA sequencing or single molecule sequencing methods.
  • 1001 molecular barcode from the synthetic transposon shown on the left
  • 1002 molecular barcode from the synthetic transposon shown on the right.
  • FIG. 12 shows an exemplary pipeline for analyzing sequencing data from Illumina reads of a sequencing library prepared using the synthetic transposons of the present application.
  • the sequencing library may be a PCR-amplified library prepared as in FIG. 9.
  • the present application discloses synthetic transposons, methods and kits for preparing sequencing libraries from a target nucleic acid, which can be analyzed using next-generation sequencing methods.
  • the synthetic transposons of the present application comprise two non- complementary regions that comprise adapter sequences, wherein the two non-complementary regions are connected to each other and are located between two stem fragments each containing a transposase recognition site. Transposition of the synthetic transposons into the target nucleic acid and subsequent steps that separate the non-complementary regions results in fragmentation of the target nucleic acid, and introduction of the adapter sequences at the same time.
  • the resulting product may be sequenced directly, or amplified in a subsequent step prior to sequencing using primers that match the adapter sequences.
  • the synthetic transposons are designed to comprise a molecular barcode disposed between the transposase recognition site and the non-complementary region.
  • a plurality of synthetic transposons each having a different molecular barcode may be used to prepare a library preserving the contiguity information in the target nucleic acid through the molecular barcodes.
  • the compositions, methods, kits and analysis tools described herein are useful for many applications, including haplotyping, de novo assembly of whole genomes or long contiguous sequences, sequencing of repetitive regions, detection of structural variations and copy number variations, and methylation analysis.
  • FIG. 1 illustrates a method for preparing a sequencing library using a regular double- stranded synthetic transposon having paired barcodes and a pair of adapter sequences disposed in between the paired barcodes, such as the synthetic transposons described in US patent No. 8,829,171.
  • the synthetic transposons are integrated into the target DNA, the product of which is repaired, and subsequently PCR amplified using primers that match the adapter sequences (i.e. , having the same sequences or reverse complementary sequences as the adapter sequences).
  • the synthetic transposons can be inserted in two opposite orientations, yielding three different potential configurations (Config. 1, 2, and 3 in FIG. 1) for fragment of target nucleic acid surrounded by a pair of synthetic transposons with respect to the orientation of the adapter sequences.
  • Config. 1 yields template 1 that can be amplified with high efficiency.
  • Templates 2 and 3 either have F primer binding sites in both ends or R primer binding sites in both ends, leading to self-hairpin structures during renaturation after denaturation step.
  • target sequence fragments having configurations 2 and 3 may become missing or under-represented in the amplified library prepared using such method, leading to difficulty in linking the fragment sequences together for haplotyping purpose, or errors in quantification of the fragments.
  • the sequencing cost could also be increased due to missing or bias amplification using the library preparation method of FIG. 1.
  • the synthetic transposons and methods described herein solves this problem by incorporating the adapter sequences in the non- complementary regions, and introducing two adapter sequence pairs for each insertion site, thereby yielding only one fragment configuration with respect to the adapter orientations, which is amenable to PCR amplification.
  • some embodiments of the synthetic transposons described herein are used to insert the adapter sequences into target nucleic acid, which is subsequently fragmented using simple denaturation or enzymatic cleavage steps that separate the non-complementary regions. The resulting fragments can be directly sequenced without further ligation to sequencing adapters.
  • Y- shaped adapters comprising sequencing adapters are ligated to fragmented nucleic acids.
  • Such methods require end-processing steps prior to the ligation, such as blunt-end polishing, or addition of T or A to the ends of the fragments.
  • end-processing steps may have varying efficiency for different end sequences, which result in biased coverage of fragments in the target nucleic acids.
  • the synthetic transposons and methods described herein overcome such challenges by introducing adapter sequences and fragmenting the target nucleic acid in a single process.
  • one aspect of the present application provides a synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region.
  • compositions comprising a plurality of synthetic transposons, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complement
  • Another aspect of the present application provides a method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with any of the synthetic transposons or compositions comprising a plurality of the synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • One aspect of the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non-complementary region and the second non-complementary region comprise different adapters.
  • the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter, wherein each of the first strand and the second strand of the second non-complementary region is connected to one strand of the second stem; and wherein the first non-complementary region and the second non-complementary region are connected to each
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter; wherein the second non-complementary region comprises a first strand comprising a third adapter and a second strand comprising a fourth adapter; and wherein the first non-complementary region and the second non- complementary region are connected to each other.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first molecular barcode and the second molecular barcode have the same sequence. In some embodiments, the first molecular barcode and the second molecular barcode have different sequences. In some embodiments, the first adapter and the fourth adapter have the same sequence. In some embodiments, the second adapter and the third adapter have the same sequence. In some embodiments, the first non- complementary region and the second non-complementary region comprise different adapters.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the present application provides a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence, wherein each of the first strand and the second strand of the first non-complementary region is connected to one strand of the first stem; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence, wherein each of the first strand and the second strand of the second non- complementary region is connected to one strand of the second stem; and wherein the first non- complementary region and the second non-complementary region are connected to each
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused to the first strand of the second non- complementary region via one or more cleavable nucleotides (such as a uracil nucleotide); and wherein the second strand of the first non-complementary region is fused to the second cleavable nucleotides (such
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-strand
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; and wherein the first single-strand
  • each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence.
  • the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker; wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker; wherein the synthetic transposon further
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first strand of the first non- complementary region is fused to the first strand
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the second strand of the first non-complementary region is fused to the second
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • ME mosaic element
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non- complementary region is fused to the first strand of
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double- stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • each of the first single- stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • a synthetic transposon comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complementary region are fuse
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • synthetic transposon fragments corresponding to the first fragment or second fragment of any one of the synthetic transposons described herein. Synthetic transposon fragments that are not connected to each other may be used for fragmenting a target nucleic acid, and to allow amplification of the fragments by PCR using primers corresponding the first and second adapters. [0093] In some embodiments, there is provided a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
  • a synthetic transposon fragment comprising a stem and a non-complementary region comprising a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
  • a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a blunt end.
  • a synthetic transposon fragment comprising a stem, a non-complementary region, and a molecular barcode disposed between the stem and the non-complementary region, wherein the non-complementary region comprises a first strand comprising a first adapter and a second strand comprising a second adapter, wherein the stem has a hair-pin structure on the end distal from the non-complementary region.
  • compositions comprising any one of the synthetic transposons described herein.
  • composition comprising a plurality of synthetic transposon each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other.
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode have the same barcode sequence. In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first non-complementary region and the second non-complementary
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the second strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand of the first non-complementary region is fused
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME). In some embodiments, the first molecular barcode and the second molecular barcode are double-stranded. In some embodiment
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide (such as a uracil nucleotide).
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • composition comprising a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; wherein the first strand and the second strand of the first non-complement
  • the first stem or the second stem comprises a terminal hairpin structure. In some embodiments, the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon. In some embodiments, the synthetic transposon comprises one or more modified nucleotides. In some embodiments, the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences. In some embodiments, the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • the first molecular barcode and the second molecular barcode are double-stranded. In some embodiments, the first molecular barcode or the second molecular barcode comprises a single-stranded region. In some embodiments, the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • transposase can form a functional complex (i.e., transposome) with one or more transposes recognition sites, and is capable of catalyzing a transposition reaction.
  • a complex comprising a synthetic transposon and a transposase, wherein the synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • the transposase is a dimeric transposase.
  • the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5TM.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence.
  • the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • composition comprising a plurality of complexes each comprising a synthetic transposon and a transposase, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first copy of a molecular barcode disposed between the first transposase recognition site and the first non- complementary region, and the second stem comprises a second transposase recognition site and a second copy of the molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the transposase is Tn5 transposase, such as a hyperactive Tn5 transposase, for example, EZ-Tn5TM.
  • the first stem or the second stem comprises a terminal hairpin structure.
  • the first stem and the second stem comprise blunt ends.
  • the synthetic transposon is a DNA transposon.
  • the synthetic transposon comprises one or more modified nucleotides.
  • the first transposase recognition site and the second transposase recognition site have the same sequence. In some embodiments, the first transposase recognition site and the second transposase recognition site have different sequences.
  • the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • the first strand of the first non- complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • each synthetic transposon has a different barcode sequence.
  • the first molecular barcode and the second molecular barcode are double-stranded.
  • the first molecular barcode or the second molecular barcode comprises a single- stranded region.
  • the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • the complexes can be prepared by mixing the plurality of synthetic transposons and the transposase.
  • the synthetic transposons and the transposase are incubated for at least about any one of 1 minute, 5 minutes, 10 minutes, 30 minutes, 1 hour or more to form the complexes.
  • the synthetic transposons described herein are nucleic acids containing two synthetic transposon fragments. Unless described otherwise, all elements of the synthetic transposons, including fragments, stems, non-complementary regions, transposase recognition sites, molecular barcodes, adaptors, strands, stuff sequences, bridge nucleic acids, etc., are nucleic acids.
  • the two synthetic transposon fragments are arranged in the same orientation with respect to each other, i.e. , the fragments are connected to each other via direct or indirect interaction between the 3' end of one fragment and the 5' end of the other fragment on one strand or both strands. Each fragment may be fully double-stranded, partially single- stranded, or a hairpin. Each fragment has two ends. The two fragments are connected to each other via the non-complementary regions disposed at one end of each fragment.
  • each fragment contains a stem comprising a transposase recognition site.
  • stem as used herein refer to nucleic acid fragments having extensive fully complementary regions. Each stem typically has two strands that can be separate from each other, or connected to each other on one end via a loop to form a hairpin structure. With the exception of stems having hairpin structures on the ends, the ends of the stems are fully complementary and double stranded. Stems with hairpin structures on the ends have the hairpin structures connected to fully complementary and double-stranded regions.
  • the stems may have a small single-stranded region no more than about any of 20, 15, 10, or 5 nucleotides long, or have an internal non- complementary region of no more than about any of 15, 10, 8, 5, or 2 nucleotides long.
  • each stem has two nucleic acid strands that are fully complementary to each other.
  • one strand contains a single-stranded gap, for example, in the molecular barcode region.
  • One end of the stem (referred herein as the "proximal end”) is fused to the non- complementary region.
  • the other end of the stem (referred herein as the "distal end”) can be a blunt end, or a hairpin.
  • the distal end(s) of one or both stems comprise nucleotides flanking the transposase recognition sites.
  • the stem further comprises a molecular barcode placed between the transposase recognition site and the non- complementary region.
  • the stem further comprises one or more stuff sequences, which are nucleic acids having pre-determined (also referred to as "fixed") sequences. The stuff sequences may be placed between the end of the stem and the transposase recognition site, between the transposase recognition site and the molecular barcode, between the transposase recognition site and the non-complementary region, and/or between the molecular barcode and the non-complementary region.
  • the stuff sequences may provide priming sites, balance G/C contents, and/or minimize secondary structures that facilitate preparation of the synthetic transposons. Additionally, stuff sequences may be chosen to complement the molecular barcodes and the non-complementary regions to allow enough space and flexibility in the synthetic transposon to facilitate binding of the transposase to the transposase recognition sites. The stuff sequences can also facilitate data analysis steps (such as for easy alignment and clustering of sequencing reads).
  • one or more of the 5' ends (also referred herein as 5' termini) of the polynucleotide strands in the synthetic transposons are phosphorylated, or the 5' terminal nucleotide has a 5' phosphate group.
  • Phosphorylated 5' ends facilitate ligation to other nucleic acids, such as adapters, extended, or gap-filled nucleic acid strands (e.g. , for nick-sealing).
  • the 5' terminus of the distal end of the first stem and/or the second stem is phosphorylated.
  • the first stem or the second tern comprises a single-stranded region
  • the first molecular barcode or the second molecular barcode comprises a single-stranded region or is single-stranded
  • the 5' terminus adjacent to the singe-stranded region is phosphorylated.
  • one or more of the 5' ends of the polynucleotide strands in the synthetic transposons are unphosphorylated, for example, the 5' terminal nucleotide has a 5' free hydroxyl group. Synthetic transposons having 5' hydroxyl ends may be phosphorylated in the library construction steps to enable ligation to other nucleic acids or nick-sealing.
  • the non-complementary regions of the synthetic transposons allow processing, efficient amplification, and haplotyping of a target nucleic acid inserted with the synthetic transposon.
  • Each non-complementary region comprises two non-complementary strands of nucleic acids.
  • Each of the non-complementary strands in the non-complementary region is connected to one strand of the corresponding stem region.
  • the two strands of a non- complementary region do not hybridize to each other at normal pH and ionic conditions (such as pH 7 and 150 mM salt).
  • the two strands of a non-complementary region have no more than about any of 60%, 50%, 40%, 30%, 20%, 10%, 5%, or less sequence homology.
  • the two strands of a non-complementary region have no more than about any of 5, 4, 3 or 2 consecutive nucleotides that are complementary to each other. In some embodiments, each strand of a non-complementary region does not form any significant secondary structure.
  • Each strand of a non-complementary region comprises an adapter sequence (also referred herein as an "adapter").
  • the adapter sequences serve as priming sites to allow amplification of a nucleic acid fragment inserted with the synthetic transposon.
  • the two non-complementary regions in a synthetic transposon are identical, but are placed in opposite orientations.
  • each non-complementary region comprises a first strand comprising an adapter sequence F, and a second strand comprising an adapter sequence R, and F of the first non-complementary region is connected to R of the second non- complementary region, and/or R of the first non-complementary region is connected to F of the second non-complementary region.
  • a pair of primers may be designed to comprise the sequence of F or R, or to comprise the complementary sequence of F or R for use in amplification of a nucleic acid fragment comprising the non-complementary regions inserted at both ends.
  • the adapter sequences may be of any suitable length, for example, at least about any of 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, or more nucleotides long.
  • the two non-complementary regions have different sets of adapter sequences.
  • the two non-complementary regions have the same set of adapter sequences, but comprise different stuff sequences on one or both strands.
  • Each non-complementary region may comprise two separate strands, or a single fused strand comprising the first strand and the second strand.
  • each non- complementary region is V-shaped, comprising a first strand and a second strand.
  • the first strand and the second strand of each non-complementary region are fused to each other via a single-stranded linker.
  • the first non-complementary region comprises a first strand comprising a first adapter sequence, a second strand comprising a second adapter sequence, and a first single-stranded linker disposed between the first strand and the second strand; and the second non-complementary region comprises a first stand comprising the second adapter sequence, a second strand comprising the first adapter sequence, and a second single-stranded linker disposed between the first strand and the second strand.
  • the first single-stranded linker can hybridize to the second single-stranded linker.
  • the first single-stranded linker is fully complementary to the second single- stranded linker.
  • the first single-stranded linker is complementary to the second single-stranded linker except for one or more cleavable nucleotides.
  • the clustering primer and sequencing primer sequences can be included in the non-complementary strands to allow PCR-free direct next generation sequencing.
  • the non-complementary regions are connected to each other either covalently or non- covalently (such as via hybridization of two sequences).
  • the first strand of the first non-complementary region can be fused to the first strand of the second non-complementary region.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region.
  • the single-stranded linker of the first non-complementary region is hybridized to the single-stranded linker of the second non-complementary region.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first sequence that hybridizes to the first single-stranded linker of the first non-complementary region, and a second sequence that hybridizes to the second single-stranded linker of the second non-complementary region, thereby, the two non-complementary regions are connected to each other via the hybridization of the bridge nucleic acid to the first single-stranded linker and the second single-stranded linker.
  • the synthetic transposon comprises one or more cleavable nucleotides at the junction(s) between the first non-complementary region and the second non- complementary region, in the bridge nucleic acid, or in the single-stranded linker of the first non-complementary region and/or the second complementary region. Cleavage of the one or more cleavable nucleotides results in separation of the first non-complementary region from the second non-complementary region.
  • the first strand of the first non-complementary region and the first strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides
  • the second strand of the first non-complementary region and the second strand of the second non-complementary region are fused to each other via one or more cleavable nucleotides.
  • the single-stranded linker comprises one or more cleavable nucleotides.
  • the bridge nucleic acid comprises one or more cleavable nucleotides in the sequences that are complementary to the single-stranded linkers.
  • the one or more cleavable nucleotides may be one or more uracil nucleotides, other modified nucleobases with specific nucleases that recognize such nucleobases (such as 8-oxoguanine), a restriction site, or RNA nucleotides wherein the synthetic transposon is a DNA transposon.
  • Uracil DNA glycosylase combined with a DNA glycosylase lyase can be used to cleave a uracil deoxyribonucleotide; and RNA nucleotides can be cleaved by an RNA endonuclease.
  • the synthetic transposon further comprises a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region, and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region.
  • the first molecular barcode and the second molecular barcode may have the same sequence, or different sequences.
  • the first molecular barcode has the same sequence as the second molecular barcode, which allows matching of the molecular barcode sequences from sequencing reads to extract contiguity information in a target nucleic acid inserted with the synthetic transposon.
  • Synthetic transposons having no molecular barcodes or two different molecular barcodes can be used for preparing libraries of template nucleic acids useful for a variety of sequencing applications (except for haplotyping) in the same way as synthetic transposons having molecular barcodes.
  • the molecular barcode comprises a plurality of nucleotides that are randomly or degenerately designed, thereby yielding a highly diverse sequence that can be used to identify each individual synthetic transposon, and the target nucleic acid or fragment thereof that the synthetic transposon inserts into.
  • the molecular barcode is double-stranded.
  • the molecular barcode comprises a single-stranded region, or is single-stranded.
  • the composition may comprise any number of synthetic transposons having different molecular barcodes.
  • the composition comprises a single copy of each synthetic transposon having a different molecular barcode.
  • the composition comprises more than one copy of each synthetic transposon having a different molecular barcode.
  • the plurality of synthetic transposons have at least about any one of 10 4 , 10 s , 10 6 , 10 7 , 10 8 , 10 9 , 10 10 , 10 11 , 10 12 , 10 13 , 10 14 , 10 15 , 10 16 , 10 17 , or more different molecular barcodes.
  • the plurality of synthetic transposons have at least about any one of 10 4 , 10 s , 10 6 , 10 7 , 10 8 , 10 9 , 10 10 , 10 11 , 10 12 , 10 13 , 10 14 , 10 15 , 10 16 , 10 17 , or more sources of clonal molecular barcodes.
  • the nucleotide can be a ribonucleotide, or a deoxyribonucleotide.
  • the molecular barcode can thus be used to identify a particular fragment of a target nucleic acid that the synthetic transposon carrying the molecular barcode inserts into.
  • the molecular barcode may further comprise nucleotides having the same identity for all synthetic transposons (i.e. "fixed” or specifically designed nucleotides).
  • the additional fixed nucleotides or sequences can be placed on either side of the randomly or degenerately designed sequence or interspersed among the randomly or degenerately designed nucleotides.
  • the molecular barcode comprises double-stranded regions. In some embodiments, the molecular barcode is double-stranded. In some embodiments, the molecular barcode is single-stranded. In some embodiments, the molecular barcode is partially single- stranded (i.e. , partially double-stranded). In some embodiments, the molecular barcode has a single-stranded region having at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50 or more nucleotides.
  • the randomly and/or degenerately designed nucleotides in the molecular barcode are in single- stranded region of the molecular barcode.
  • the double-stranded region of the at least partially single-stranded molecular barcode comprises fixed nucleotides.
  • the double-stranded region of the at least partially single-stranded molecular barcode consists essentially of fixed nucleotides.
  • the molecular barcode comprises at least about any one of 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70 80, 90, 100 or more consecutive nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40 or more randomly designed nucleotides. In some embodiments, the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more degenerately designed nucleotides.
  • the molecular barcode comprises at least about any one of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40 or more fixed (i.e., specifically designed) nucleotides.
  • the molecular barcode is a mixture of randomly designed, degenerately designed or fixed nucleotides. The number of randomly and/or degenerately designed nucleotides in the molecular barcode depends on the actual need.
  • a long target nucleic acid (such as chromosome) may need a plurality of synthetic transposons with higher diversity, i.e., a large number of randomly and/or degenerately designed nucleotides, to provide enough distinct molecular barcodes to tag the large number of segments of the target nucleic acid in order to extract contiguity information.
  • a short target nucleic acid such as a plasmid of a few kilobases long, may only need a small number of randomly and/or degenerately designed nucleotides to provide enough distinct molecular barcodes for tagging.
  • duplicated sequences endogenous to the target nucleic acid flanking the insertion sites of the synthetic transposons may be used in combination with the molecular barcodes in the synthetic transposons to provide contiguity information for the target nucleic acids. Having both randomly designed and specific nucleotides may minimize potential undesired non-specific interactions during the process of synthesizing the synthetic transposons.
  • FIGs. 2A-20 Exemplary synthetic transposons and fragments are shown in FIGs. 2A-20.
  • FIGs. 2A- 2D show exemplary synthetic transposon fragments each comprising a single transposon recognition site.
  • FIG. 2E shows an exemplary synthetic transposon with no molecular barcodes.
  • FIGs. 2F-20 shows exemplary synthetic transposons having two molecular barcodes, and various structures for the non-complementary regions and distal ends of the stems.
  • any of the exemplary synthetic transposons of FIG. 2F- 20 can be modified by replacing the molecular barcodes with stuff sequences or other sequences needed to make corresponding exemplary synthetic transposons that do not have molecular barcodes.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having no molecular barcodes.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non- complementary regions to provide a synthetic transposon having no molecular barcodes.
  • a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon.
  • a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon. If the transposon fragments in Fig.
  • stem-loop common sequences e.g., containing sequencing primer
  • stem-loop common sequences e.g., containing sequencing primer
  • repairing e.g., extension and ligation
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes.
  • FIG. 8C shows an exemplary synthetic transposon fragment of FIG. 2C.
  • a synthetic transposon fragment comprising: (a) a stem comprising from the distal end to the proximal end: a first transposase recognition site, a molecular barcode, and an optional stuff sequence, and (b) a non- complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a single-stranded linker disposed between the first strand and the second strand, wherein the distal end of the stem is a blunt end.
  • Two of the synthetic transposon fragments having the same sequences or different sequences can be linked together via the non-complementary regions to provide a synthetic transposon having two molecular barcodes.
  • a first synthetic transposon fragment of FIG. 2B having a first single-stranded linker and a second synthetic transposon fragment of FIG. 2B having a second single-stranded linker that can hybridize to the first single-stranded linker can be mixed together to provide a synthetic transposon of FIG. 2K.
  • a first synthetic transposon fragment having a first single-stranded linker and a second synthetic transposon fragment having a second single-stranded linker can be mixed together with a bridge nucleic acid that can hybridize to both the first single-stranded linker and the second single-stranded linker to provide a synthetic transposon of FIG. 2M.
  • FIG. 8D shows an exemplary synthetic transposon fragment of FIG. 2D.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleav
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • FIG. 8B shows an exemplary synthetic transposon of FIG. 2E.
  • FIG. 8E shows primers that can be used to amplify nucleic acid fragments obtained from insertion of a plurality of the synthetic transposon in a target nucleic acid followed by enzymatic cleavage of the UU dinucleotide that separates the two non-complementary regions in each synthetic transposon.
  • the primers in FIG.8E contain sequences from sequencing primers of the Illumina sequencing platform that allow direct sequencing of the amplified nucleic acid fragments on an Illumina instrument. Randomly designed index tag sequences can be included in one primer to serve as a sample barcode, which allows multiple samples to be sequenced at the same time and subsequently de-multiplexed during data analysis.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and where
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single- stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; and
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal ends of the first stem and the second stem are both blunt ends; wherein the
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non- complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • FIG. 8A shows an exemplary synthetic transposon of FIG. 21 having two non-identical short molecular barcodes.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F, a second strand comprising an adapter sequence R, and a first single-stranded linker disposed between the first strand and the second strand; (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R, a second strand comprising the adapter sequence
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second single-stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • the 5' end neighboring the gap of missing strand in the single-stranded molecular barcode is phosphorylated.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • a synthetic transposon comprising: (1) a first fragment comprising: (a) a first stem comprising from the distal end to the proximal end: a first transposase recognition site, a first double-stranded molecular barcode, and an optional stuff sequence, and (b) a first non-complementary region comprising a first strand comprising an adapter sequence F and a second strand comprising an adapter sequence R; and (2) a second fragment comprising: (a) a second stem comprising from the distal end to the proximal end: a second transposase recognition site, a second double- stranded molecular barcode, and an optional stuff sequence, and (b) a second non-complementary region comprising a first strand comprising the adapter sequence R and a second strand comprising the adapter sequence F; wherein the distal end of the first stem is a blunt end, and the distal end of
  • the two transposase recognition sites can have the same or different sequences.
  • the two molecular barcodes may have the same or different sequences.
  • Primers matching the F and R sequences can be used to amplify target nucleic acid fragments after insertion of the synthetic transposon into a target nucleic acid.
  • the one or more cleavable nucleotides can be cleaved to separate the two non-complementary regions to allow downstream manipulation of a target nucleic acid interested with the synthetic transposon or a plurality of the synthetic transposons.
  • the synthetic transposons provided herein can be prepared by a variety of methods.
  • the synthetic transposons are prepared by direct synthesis, including chemical synthesis. Such methods are well known in the art, e.g., solid phase synthesis using phosphoramidite precursors such as those derived from protected 2'-deoxynucleosides, ribonucleosides, or nucleoside analogues.
  • Synthetic transposons comprising modified nucleotides may also be chemically synthesized by including modified nucleotide building blocks in the oligo synthesis steps.
  • an unmodified synthetic transposon may first be synthesized, and the 5-methyl group may be added to the target dC nucleobase using a CpG methyltransferase.
  • Synthesis of long oligos up to 180-250 nucleotides (nt) required in this application can be obtained commercially from multiple sources such as IDT (ultramers for up to 200nt), Sigma- Aldrich (up to 180nt) or Biosynthesis (Ubermers up to 250nt regularly and could be as long as 400nt).
  • IDT ultramers for up to 200nt
  • Sigma- Aldrich up to 180nt
  • Biosynthesis Up to 250nt regularly and could be as long as 400nt.
  • Incorporation of modified bases such as LNA or PNA in some common sequences allow the use of short sequences with the same binding stability needed.
  • Modified bases such as uracil can be incorporated easily to allow the cleavage of the strand before library amplification. Incorporation of phosphorothiate bonds, for example, can help to minimize degradation of transposons by exonucleases or endonucleases during their storage.
  • the synthetic transposons are prepared by annealing two oligos, which are then subjected to extension by polymerases to provide the full product.
  • Synthetic transposons having no molecular barcodes or having two different molecular barcodes can be prepared by such methods.
  • FIG. 8A shows a method of preparing an exemplary synthetic transposon of FIG. 21 having two different molecular barcodes.
  • FIG. 8B shows a method of preparing an exemplary synthetic transposon of FIG. 2E having no molecular barcodes.
  • Synthetic transposons with one or two hairpin structures can be conveniently prepared using a single long strand of oligonucleotide with complementary regions that hybridize to provide the synthetic transposons.
  • the synthetic transposons are PCR amplified with common primers, such as primers that hybridize to the stuff sequences to prepare the synthetic transposons.
  • the synthetic transposons are prepared by linking the non- complementary regions of two synthetic transposon fragments.
  • the synthetic transposon fragment is prepared by chemical synthesis.
  • the synthetic transposon fragment is prepared by extending chemically synthesized
  • FIG. 8C shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2C having a molecular barcode.
  • FIG. 8D shows a method of preparing an exemplary synthetic transposon fragment of FIG. 2B having no molecular barcode.
  • Synthetic transposons having two molecular barcodes with the same sequences comprising randomly or degenerately designed nucleotides are prepared using a combination of chemical synthesis and extension by polymerase (also referred as "primer extension") to obtain double-stranded molecular barcodes, and to ensure that the two molecular barcodes have the same sequences.
  • the synthetic transposons having identical paired molecular barcodes are prepared using starting oligos containing only one molecular barcode, followed by a first intramolecular or intermolecular priming to replicate the molecular barcode.
  • a 2nd intramolecular or intermolecular priming is used to displace the replicated molecular barcode sequence.
  • FIGs. 3-7 illustrate exemplary methods for preparing various synthetic transposons having two identical molecular barcodes that contain randomly or degenerately designed nucleotides.
  • a first synthesized oligo (5'-301+302+303+304+305+U+ 306+303rc-3') is provided, which comprises a single-stranded molecular barcode region (302) having randomly or degenerately designed nucleotides.
  • the first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the adapter sequence 306 and the complementary sequence of stuff sequence 304 (i.e. 304rc).
  • the hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon, which is subsequently annealed to a second synthesized oligo comprising the adapter sequence 305 and the stuff sequence 303, and a third synthesized oligo comprising the transposase recognition sequence 301, to provide a synthetic transposon of FIG. 2F.
  • Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (302rc) provide a synthetic transposon of FIG. 2G.
  • a first synthesized oligo (5 ' -401 +402+403+404+405+U+406+403rc-3 ' ) is provided, which comprises a single-stranded molecular barcode region (402) having randomly or degenerately designed nucleotides.
  • the first synthesized oligo is extended by a DNA polymerase, which is then denatured and hybridized to a second synthesized oligo comprising the complementary sequence of stuff sequence 404 (i.e. , 404rc), adapter sequence 406, a uracil nucleotide, adapter sequence 405, and stuff sequence 403.
  • the hybridized oligos are then extended by a DNA polymerase to make the first fragment of the synthetic transposon connected to the second non-complementary region, which is subsequently annealed to a second synthesized oligo comprising the transposase recognition sequence 401, to provide a synthetic transposon of FIG. 2H. Further extension and ligation steps to fill in the gap with a DNA polymerase to make the first fragment of the synthetic transposon connected to the second non-complementary region, which is subsequently annealed to a second synthesized oligo comprising the transposase recognition sequence 401, to provide a synthetic transposon of FIG. 2H. Further extension and ligation steps to fill in the gap with a
  • a first synthesized oligo (5'-501+502+503+504+505+506+507+508+505rc- 3') and a second synthesized oligo (5'-503+506+507+508+509+504rc-3') are provided, which are hybridized and extended by a DNA polymerase.
  • the 3' end of the first synthesized oligo is a reversibly blocked nucleotide.
  • the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end.
  • T4 PNK T4 polynucleotide kinase
  • the 3' end Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end.
  • a further round of extension followed by hybridization to a third synthesized oligo comprising the transposase recognition sequence 501 provides a synthetic transposon of FIG. 2J.
  • Single- stranded linker sequences 507 and 507rc each has one or more cleavable nucleotides.
  • a first synthesized oligo (5 ' -601 +602+603+604+605+606+607+608+605rc- 3') a second synthesized oligo (5'-603+606+609+608+610+604rc-3'), and a third synthesized oligo (5'-607rc+611+609rc, i.e. bridge nucleic acid) are provided, which are denatured and hybridized, and then extended by a DNA polymerase.
  • the 3' end of the first synthesized oligo is a reversibly blocked nucleotide.
  • the 3' end of the first synthesized oligo may first comprise a 3' phosphate group, which is removed by T4 polynucleotide kinase (T4 PNK) treatment prior to the first extension step to block extension of this 3' end.
  • T4 PNK T4 polynucleotide kinase
  • the 3' end Prior to the second extension step, the 3' end can be phosphorylated, thereby allowing extension from this 3' end.
  • a further round of extension followed by hybridization to a fourth synthesized oligo comprising the transposase recognition sequence 601 provides a synthetic transposon of FIG. 2L.
  • the bridge nucleic acid may contain one or more cleavable nucleotides in the 607rc and 609rc fragments.
  • the hairpin fragment 707 and the transposase recognition site 701 each has one or more cleavable nucleotides.
  • the oligo is denatured, hybridized, and extended by DNA polymerase.
  • the one or more cleavable nucleotides in 707 and 701 are then cleaved, and the product is denatured and hybridized to a second synthesized oligo (5'-705-703-U-706-704rc-3').
  • the duplex is then extended by DNA polymerase to provide a synthetic transposon of FIG. 2N. Further extension and ligation steps to fill in the gap with a complementary sequence of the molecular barcode (702rc) provide a synthetic transposon of FIG. 20.
  • One aspect of the present application provides a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non- complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with any one of the composition comprising a plurality of synthetic transposons described herein, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) amplifying the repaired target nucleic acid to provide the library of template nucleic acids.
  • the amplifying is Whole Genome Amplification (WGA). In some embodiments, the amplifying is targeted amplification of loci of interest.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a
  • the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides.
  • the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase (e.g. , USERTM), to cleave the one or more cleavable nucleotides.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a
  • the method further comprises treating the denatured repaired target nucleic acid with an exonuclease.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the first single-stranded linker and the second single-stranded linker each comprises one or more cleavable nucleotides.
  • the method further comprises contacting the repaired target nucleic acid with an endonuclease, such as uracil DNA glycosylase (UDG), or a mixture of UDG and DNA lyase ⁇ e.g. , USERTM), to cleave the one or more cleavable nucleotides.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of preparing a library of template nucleic acids comprising: (a) contacting a target nucleic acid with a composition comprising a plurality of synthetic transposons, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a barcoded target nucleic acid comprising a plurality of synthetic transposons inserted randomly or substantially randomly among the endogenous sequence of the barcoded target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non- complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand compris
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single- stranded linker are hybridized to each other.
  • each synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single- stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • FIGs. 9-11 show exemplary methods of preparing libraries of template nucleic acids using the synthetic transposons described herein. Additionally, synthetic transposons that do not have molecular barcodes (e.g. , FIG. 2E and FIGs. 8B) can be used for fragmentation and library construction. The intramolecular or intermolecular binding between the 2 transposase recognition sites and transposase in a transposed target nucleic acid allow the stable
  • a composition comprising a plurality of synthetic transposons of FIG. 21 each having a different barcode sequence is contacted with a target DNA and a transposase.
  • the plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is treated with UDG (such as USERTM) to cleave the uracil nucleotide, yielding fragmented template nucleic acids, which are PCR amplified with primers having adapter sequences F and R, or their reverse complements.
  • UDG such as USERTM
  • PCR amplification leads to 2 products that have different read orientations during sequencing.
  • Additional adapter sequences such as sequencing primer sequences, and/or index tags, may be introduced to each amplified nucleic acid by including the adapter sequences and index tags in the PCR primers.
  • the amplified nucleic acid library can then be sequenced using any suitable massively parallel shotgun sequencing method (such as next generation sequencing, or NGS method).
  • WGA whole genome amplification
  • WGA can be performed using either random hexamers or sequences complementary to F and/or R.
  • WGA can be used in the library preparation method, separation of the non-complementary regions is not a required step, and thus, synthetic transposons that do not have modified nucleotide(s) linking the two non-complementary regions can be used.
  • a composition comprising a plurality of synthetic transposons of FIG. 2G each having a different barcode sequence is contacted with a target DNA and a transposase.
  • the plurality of synthetic transposons are inserted into the target DNA, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is treated with UDG (such as USERTM) to cleave the uracil nucleotide, yielding fragmented template nucleic acids.
  • UDG such as USERTM
  • the fragmented template nucleic acids are then hybridized to an oligonucleotide comprising a first sequence that is complementary to the first adapter sequence F and a second sequence that is complementary to the second adapter sequence R.
  • the hybridized fragments are then treated with ligase to circularize the fragmented template nucleic acids.
  • the circularized template nucleic acids can be further analyzed by RCA, or by single-molecule sequencing.
  • a composition comprising a plurality of synthetic transposons of FIG. 2M each having a different barcode sequence is contacted with a target DNA and a transposase, resulting in single-stranded gaps surrounding the insertion sites.
  • the inserted target DNA is then treated with DNA polymerase to fill in the single-stranded gaps, and simultaneously or subsequently treated with a ligase to seal the nicks.
  • the repaired target DNA is then denatured, and subsequently treated with exonuclease to remove the bridge nucleic acids, thereby yielding a library of circularized template nucleic acids.
  • the library of circularized template nucleic acids can then be analyzed by RCA or single-molecule sequencing.
  • the plurality of synthetic transposons can be inserted into target nucleic acids by the transposase that binds to the transposase recognition sites of the synthetic transposons.
  • the plurality of synthetic transposons and the transposase may be pre-mixed to form a complex composition comprising a plurality of complexes each comprising a transposase bound to a synthetic transposon prior to contacting the complex composition with the target nucleic acid.
  • the plurality of synthetic transposons and the transposase are contacted with the target nucleic acids simultaneously, but as separate compositions.
  • synthetic transposons with molecular barcodes having high diversity comprising more than about any one of 5, 10, 15, 20, 25, or more randomly and/or degenerately designed nucleotides are used to ensure that each insertion site in the target nucleic acid has a different molecular barcode.
  • an excess amount of synthetic transposons is contacted with the target nucleic acid to ensure unique labeling of the sites in the target nucleic acid.
  • no more than about any one of 50%, 40%, 30%, 20%, 10%, 5%, 2%, 1%, 0.1%, 0.01%, 0.001%, 0.0001% or less of possible synthetic transposons with distinct molecular barcodes are inserted into the target nucleic acid.
  • 100 cells of human genomic DNA (about 0.6 ng) have a total of 300xl0 9 basepairs.
  • synthetic transposons each having a molecular barcode comprising 25 randomly designed nucleotides at an average of 150- bp distance
  • 2xl0 9 synthetic transposons are inserted out of 10 15 possible distinct synthetic transposons available.
  • transposase duplicated sequences e.g. , 9-nt duplicate sequence of Tn5 transposase
  • the molecular barcode sequences it would be easy to differentiate and align sequencing reads derived from neighboring fragments in a single target molecule.
  • the term "at least a portion” or grammatical equivalents thereof can refer to any fraction of a whole amount.
  • “at least a portion” can refer to at least about any one of 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.9% or 100% of a whole amount.
  • at least about any one of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99% or more of the plurality of synthetic transposons is inserted in the target nucleic acid.
  • the frequency (i.e. , density) of the synthetic transposons inserted in the target nucleic acid can be controlled by various ways, including adjusting the contacting time and temperature, the amount of synthetic transposons, the type and amount of the transposase, and composition of the buffer.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about any one of 10 kb, 5 kb, 4 kb, 3 kb, 2 kb, 1 kb, 900 bases, 800 bases, 700 bases, 600 bases, 500 bases, 400 bases, 300 bases, 250 bases, 200 bases, 150 bases, 100 bases, or fewer.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of once per any one of about 100 bases to about 200 bases, about 150 bases to about 250 bases, about 250 bases to about 500 bases, about 500 bases to about 750 bases, about 750 bases to about lkb, about 1 kb to about 5 kb, about 5 kb to about 10 kb, about 100 bases to about 1 kb, or about 100 bases to about 10 kb.
  • synthetic transposons described herein may be particularly useful and effective for preparing sequencing libraries for whole genome sequencing requiring high quality (for example, error rate lower than about 1 in 10 6 bases), targeted capture sequencing, or microbiome sequencing in clinical setting.
  • high quality for example, error rate lower than about 1 in 10 6 bases
  • targeted capture sequencing for example, targeted capture sequencing
  • microbiome sequencing in clinical setting.
  • the target nucleic acid can include any nucleic acid of interest.
  • Target nucleic acids can include, DNA, RNA, peptide nucleic acid, morpholino nucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleic acid, mixtures thereof, and hybrids thereof.
  • the target nucleic acid is genomic DNA, such as whole genome, part of the genome (e.g., individual chromosomes or fragments thereof), mixed genomes (e.g., microbiome). Intact chromosomes in live cells or isolated intact chromosomes can be used to achieve longest contiguity contigs as possible for any given species.
  • the target nucleic acid is mitochondrial DNA.
  • the target nucleic acid is chloroplast DNA.
  • the target nucleic acid is cDNA, synthetic or modified DNA after certain chemical or enzymatic treatments, including bisulfite treatment (e.g., for CpG methylation detection).
  • the target nucleic acid can be of any length.
  • the synthetic transposons and the methods described herein are particularly useful for preparing barcoded libraries to be sequenced and assembled to analyze long, contiguous target nucleic acids having a length of at least about any one of 10 kb, 20 kb, 50 kb, 100 kb, 200 kb, 500 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 20 Mb, 50 Mb, 100 Mb, 200 Mb, or more.
  • the target nucleic acid can comprise any nucleotide sequences. In some embodiments, the target nucleic acid comprises homopolymer sequences.
  • the target nucleic acid can also include repeat sequences.
  • Repeat sequences can be any of a variety of lengths including, for example, at least about any one of 2, 5, 10, 20, 30, 40, 50, 100, 250, 500, 1000 nucleotides or more. Repeat sequences can be repeated, either contiguously or non- contiguously, any of a variety of times including, for example, at least about any one of 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20 times or more.
  • the plurality of synthetic transposons is inserted in a single target nucleic acid.
  • the plurality of synthetic transposons is inserted in a plurality of target nucleic acids.
  • a plurality of target nucleic acids can include a plurality of the same target nucleic acids, a plurality of different target nucleic acids wherein some target nucleic acids are the same, or a plurality of target nucleic acids wherein all target nucleic acids are different.
  • Embodiments that involve a plurality of target nucleic acids can be carried out in multiplex formats such that reagents can be delivered simultaneously to the target nucleic acids, for example, in one or more compartments or on an array surface.
  • the plurality of target nucleic acids can include substantially all of a particular organism's genome.
  • the plurality of target nucleic acids can include at least a portion of a particular organism's genome, including, for example, at least about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • the portion can have an upper limit that is at most about any one of 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • Target nucleic acids can be obtained from any source.
  • target nucleic acids may be prepared from nucleic acid molecules obtained from a single organism or from populations of nucleic acid molecules obtained from natural sources that include one or more organisms.
  • Sources of nucleic acid molecules include, but are not limited to, organelles, cells, tissues, organs, or organisms.
  • Cells that may be used as sources of target nucleic acid molecules may be prokaryotic (bacterial cells, for example, Escherichia, Bacillus, Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium, Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella, Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium, Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota, nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (for example, yeasts), plants, protozoans and other parasites, and animals (including insects (for example, Drosophila spp.), nematodes (for example, Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey, non
  • the target nucleic acid has damaged or modified bases before, during and after preparation due to aging, exposure to acid, heat or radiation.
  • modifications include nicks, abasic sites, thymidine dimers, oxidized guanine and pyrimidines, deaminated cytosines. If left untreated, these modifications could prevent further amplifications and sequencing, and affect the accurate counting of the target nucleic acids and sequence quality.
  • Commercial repair kits are available, including NEB's PreCR Repair Mix (Cat# M0309S) and Sigma's Restorase (with DNA polymerase, Cat# R1028).
  • reagents often include DNA repair enzymes (such as uracil DNA glycosylase, Fpg, T4 Endonuclease V, Endonuclease IV and Endonuclease VIII), DNA polymerases, and ligases (such as Taq DNA ligase. Enzymes such as ligase can ligate nicks in double stranded DNA.
  • DNA repair enzymes such as uracil DNA glycosylase, Fpg, T4 Endonuclease V, Endonuclease IV and Endonuclease VIII
  • DNA polymerases such as ligases
  • ligases such as Taq DNA ligase. Enzymes such as ligase can ligate nicks in double stranded DNA.
  • a transposase (such as Tn5 transposase) binds the transposase recognition sites, makes staggered cuts at random sites in a target nucleic acid, and inserts synthetic transposons at the cut sites, resulting in a pair of single-stranded gaps of a fixed length flanking the inserted synthetic transposon sequence in the target nucleic acid.
  • the single- stranded gaps have duplicated sequences derived from the target nucleic acid.
  • the duplicated sequences are characteristic for each transposase, for example, the duplicated sequences are 9-nt long for Tn5 transposase, 5-nt long for Tn7 and Mu transposases, 4-nt long for murine leukemia virus, and 2-nt long for Tcl/marine family.
  • Transposition events are random or substantially random. For example, some studies show certain transposition biases (see, e.g., Green B et al, "Insertion site preference of Mu, Tn5, and Tn7 transposons" Mobile DNA 3:3, 2012).
  • the target nucleic acids inserted with the synthetic transposons can be repaired with a polymerase without strand displacement activity and a ligase in vitro to provide repaired target nucleic acids.
  • the polymerase without strand displacement activity allows gap filling of any single-stranded nucleic acid created surrounding the insertion sites (such as single-stranded gaps having duplicated sequences endogenous to the target nucleic acid).
  • the ligase allows nick sealing for nicks having a 5' phosphate.
  • the gap filling reaction catalyzed by the polymerase without strand displacement, and the ligation reaction catalyzed by the ligase can be carried out in a single step, or in separate steps comprising first contacting the target nucleic acid inserted with the synthetic transposons with the polymerase without strand displacement activity and nucleotides, followed by contacting the resulting product with the ligase.
  • Many polymerases and ligases may be suitable for this step.
  • the polymerase is T4 DNA polymerase.
  • the repaired target nucleic acid is then fragmented by separating the first non- complementary region from the second non-complementary region in each inserted synthetic transposon.
  • a suitable separation step may be chosen based on the nature of the connection between the first non-complementary region and the second non-complementary region.
  • an endonuclease, or a combination of endonuclease with lyase may be used to cleave one or more cleavable nucleotides that are used to fuse the first strands or the second strands of the non-complementary regions, to cleave one or more cleavable nucleotides in the single-stranded linkers of the non-complementary regions, or to cleave one or more cleavable nucleotides in the bridge nucleic acid.
  • the repaired target nucleic acid may be treated with a combination of UDG and DNA lyase, such as USERTM, to separate the non- complementary regions.
  • the endonuclease treatment step may occur simultaneously with the repair step, or after the repair step.
  • the repaired target nucleic acid may be denatured, for example by heating, and/or contacting with a denaturing buffer (e.g. , formamide), to separate the non-complementary regions.
  • a denaturing buffer e.g. , formamide
  • the fragmented target nucleic acids are further treated with an exonuclease, such as a single-strand DNA exonuclease to remove the bridge nucleic acid, and/or other undesired single-stranded nucleic acid.
  • an exonuclease such as a single-strand DNA exonuclease to remove the bridge nucleic acid, and/or other undesired single-stranded nucleic acid.
  • the repaired target nucleic acid is both contacted with an endonuclease to cleave the one or more cleavable nucleic acids and subjected to denaturing conditions to separate the non- complementary regions.
  • the nucleic acid fragments obtained after the step of separating the non-complementary regions may be used directly for single molecule sequencing, or amplified by PCR or Rolling Circle Amplification (RCA).
  • PCR can be used to amplify a small number of copies of template DNA to generate thousands to millions of copies of a particular DNA sequence. It usually requires 2 short oligos as primers (e.g. , 18- 36mer) and a heat-stable DNA polymerase (e.g. , Taq DNA polymerase) in the presence of dNTPs and buffer. Generally, it starts with an initial heating step (e.g.
  • PCR has many applications including in disease diagnosis or forensic identification and many variations are available including multiplex PCR, digital PCR, allele-specific PCR.
  • RCA is an isothermal enzymatic process where long strand nucleic acid sequences containing multiple copies are synthesized from circular molecules of DNA or RNA, such as plasmids, bacteriophages or circular RNA genome of viroids.
  • Kits are available to use RCA technology to amplify circular nucleic acids from small or limited amount of samples in hours at a constant temperature without thermal cycling, for example, TempliPhi from GE Healthcare.
  • Some NGS platforms, such as Complete Genomics, can directly sequence RCA products.
  • the template nucleic acids that are not circular may be circularized first.
  • an oligonucleotide comprising sequences that are complementary to the first adapter sequence and the second adapter sequence may be used to anneal to the non-complementary regions after the separation step, followed by treatment with ligase to circularize the nucleic acid fragments.
  • the template nucleic acids are amplified by RCA using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • the template nucleic acids are amplified by PCR using a first primer that hybridize to the first adapter sequence or reverse complement thereof, and a second primer that hybridize to the second adapter sequence or reverse complement thereof.
  • the template nucleic acids are amplified by PCR using a first primer having the same sequence as the first adapter sequence, and a second primer having the same sequence as the second adapter sequence.
  • the template nucleic acids are amplified by PCR using a first primer having the complementary sequence as the first adapter sequence, and a second primer having the complementary sequence as the second adapter sequence.
  • the whole-genome sequence is amplified.
  • primers that selectively hybridize to sequences of interest may be used for amplification of targeted sequences.
  • additional adapters and/or sample tags also referred herein as "index tags" may be included in the primers for amplification.
  • the amplification step may need long annealing/extension time to obtain products of appropriate size.
  • the method may further comprise purification step(s) to remove short, unwanted products with only the transposon sequences.
  • the method does not comprise a step of separating the non- complementary regions, and the method comprises subjecting the repaired target nucleic acid to whole genome amplification to provide the library of template nucleic acids.
  • WGA is a method for robust amplification of an entire genome, starting with a small amount of DNA and can result in thousands to millions fold of amplified products. WGA may be especially useful for preparing a library of template nucleic acids for sequencing from a limited or previous sample, such as a single cell.
  • Exemplary techniques used for WGA include, but are not limited to, Multiple Displacement Amplification (MDA), Degenerate Oligonucleotide PCR (DOP-PCR) and Primer Extension Preamplification (PEP).
  • MDA Multiple Displacement Amplification
  • DOP-PCR Degenerate Oligonucleotide PCR
  • PEP Primer Extension Preamplification
  • Exemplary commercial kits for WGA include ILLUSTRATM Single Cell GenomiPhi DNA Amplification kit from GE Healthcare,
  • the method may comprise a dilution step to separate the nucleic acid sample, such as the target nucleic acid, the inserted target nucleic acid, the repaired target nucleic acid, or the template nucleic acids into a plurality of compartments (such as wells in a multi-well plate).
  • the nucleic acid sample is diluted into at least about any of 5, 10, 20, 50, 100, 200, 300, 500 or more compartments to allow subsequent steps, such as amplification, in the methods to carry out within the individual compartments.
  • each compartment comprises no more than about any of 5000, 1000, 500, 200, 100, 50, 20, 10, 5, or fewer molecules.
  • Compartment tags may be introduced to the template nucleic acids in the amplification step. Samples from the compartment can be pooled together during sequencing, and the sequencing reads may be de-multiplexed using the compartment tags. The dilution may facilitate mapping of sequencing reads to individual target nucleic acids or segments thereof.
  • the present application further provides methods of analyzing a target nucleic acid by sequencing libraries of template nucleic acids prepared using any of the methods described above.
  • a method of analyzing a target nucleic acid, or sequencing a target nucleic acid comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using any one of the methods described in the "Methods of library preparation" section; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • each synthetic transposon comprises a different barcode sequence
  • the method further comprises assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the molecular barcodes of the synthetic transposons in the template nucleic acids.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non- complementary region are connected to each other; (b) contacting the inserted target nucleic acid with a polymerase, nu
  • the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single - stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapt
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of analyzing a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapt
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the library of template nucleic acids prepared using the methods described in the "Methods of library preparation" section can be sequenced directly or subject to any one or more of library construction steps known in the art, including, but not limited to, end repair, ligation to adapters, amplification, and sample tag addition.
  • the library construction method comprises an exome capture step.
  • the processes described herein can be used in conjunction with a variety of sequencing techniques and platforms.
  • the process to determine the nucleotide sequence of a target nucleic acid can be an automated process.
  • the sequencing is next generation sequencing.
  • the sequencing method is a massively parallel shotgun sequencing method.
  • the sequencing method yields short sequencing reads, such as sequencing reads of no more than about any one of 500 bases, 400 bases, 300 bases, 250 base, 200 bases, 150 bases, 100 bases, or fewer.
  • Exemplary sequencing platforms include, but are not limited to, Roche 454 platforms, Illumina HISEQTM, MISEQTM, and NEXTSEQTM platforms, Life Technologies SOLIDTM platforms, ION
  • Some embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical
  • cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in U.S. Pat. No. 7,427,67, U.S. Pat. No. 7,414,1163 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
  • Solexa now Illumina Inc.
  • WO 07/123,744 filed in the United States patent and trademark Office as U.S. Ser. No.
  • Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate short oligonucleotides and identify the incorporation of such short oligonucleotides.
  • Example SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
  • Some embodiments can include techniques such as next-next technologies.
  • One example can include nanopore sequencing techniques (Deamer, D. W. & Akeson, M.
  • the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
  • each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
  • nanopore sequencing techniques can be useful to confirm sequence information generated by the methods described herein.
  • Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
  • Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and ⁇ -phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference in their entireties) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
  • FRET fluorescence resonance energy transfer
  • the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682- 686 (2003); Lundquist, P.
  • SMRT real-time
  • a SMRT chip comprises a plurality of zero-mode waveguides (ZMW).
  • ZMW zero-mode waveguides
  • Each ZMW comprises a cylindrical hole tens of nanometers in diameter perforating a thin metal film supported by a transparent substrate.
  • attenuated light may penetrate the lower 20-30 nm of each ZMW creating a detection volume of about 1x10 —21 L. Smaller detection volumes increase the sensitivity of detecting fluorescent signals by reducing the amount of background that can be observed.
  • SMRT chips and similar technology can be used in association with nucleotide monomers fluorescently labeled on the terminal phosphate of the nucleotide (Korlach J. et al. , "Long, processive enzymatic DNA synthesis using 100% dye-labeled terminal phosphate-linked nucleotides.” Nucleosides, Nucleotides and Nucleic Acids, 27:1072-1083, 2008; incorporated by reference in its entirety).
  • the label is cleaved from the nucleotide monomer on incorporation of the nucleotide into the polynucleotide. Accordingly, the label is not incorporated into the polynucleotide, increasing the signal: background ratio. Moreover, the need for conditions to cleave a label from a labeled nucleotide monomer is reduced.
  • a sequencing platform that may be used in association with some of the embodiments described herein is provided by Helicos Biosciences Corp.
  • true single molecule sequencing can be utilized (Harris T. D. et al. , "Single Molecule DNA Sequencing of a viral Genome” Science 320: 106-109 (2008), incorporated by reference in its entirety).
  • a library of target nucleic acids can be prepared by the addition of a 3' poly(A) tail to each target nucleic acid. The poly(A) tail hybridizes to poly(T) oligonucleotides anchored on a glass cover slip.
  • the poly(T) oligonucleotide can be used as a primer for the extension of a polynucleotide complementary to the target nucleic acid.
  • fluorescently-labeled nucleotide monomer namely, A, C, G, or T
  • Incorporation of a labeled nucleotide into the polynucleotide complementary to the target nucleic acid is detected, and the position of the fluorescent signal on the glass cover slip indicates the molecule that has been extended.
  • the fluorescent label is removed before the next nucleotide is added to continue the sequencing cycle. Tracking nucleotide incorporation in each polynucleotide strand can provide sequence information for each individual target nucleic acid. Analysis
  • Sequencing reads can be analyzed with various methods.
  • an automated process such as computer software, is used to analyze the sequencing reads to provide a contiguous sequence of the target nucleic acid.
  • Analysis software can be developed from scratch or based on current bioinformatics tools to include molecular barcode identification and clustering algorithms described herein for sequence assembly (de novo or using a reference).
  • Data analysis of the sequencing reads include at least the following three steps: 1) find identical (or near identical, for example, 1 base difference to accommodate sequencing error) molecular barcodes, and optionally with surrounding transposase recognition site and other stuff sequences, to combine reads with the same barcode into 1 molecule (error-correction); 2) use molecular barcodes to link molecules together with haplotype information with the help of duplicate sequences during transposition (mBCs-assisted contig assembly); and 3) use actual target sequences, especially variants to help confirm the assembly of the molecules (validation).
  • Such process can remove polymerase extension errors or recombination introduced during amplification and sequencing. Additionally, the process allows absolute molecule counting. Furthermore, cross-contamination from one sample to another sample in the lab can be removed or reduced by using the molecular barcodes.
  • FIG. 12 shows an exemplary data analysis pipeline.
  • high quality pair-end sequencing reads are used.
  • the sequencing data are first de-multiplexed into separate sample folders.
  • reads with near identical molecular barcodes and target sequence similarity in a sample are clustered into individual, original target nucleic acid molecules.
  • Two or more pair-end reads are required to cluster into single molecules, and failed reads contain singletons in majority. Reads per molecule can also be calculated.
  • molecular barcodes and duplicated sequences generated during transposition e.g.
  • Gap or outliers in the sequences may be present and may limit the contig size. Gaps are due to several factors. First, with bias and randomness in transposition, there are possible long sequences between 2 transposition sites.
  • the middle regions of some long sequences may not be sequenced, or the whole long fragments may be missed, especially on sequencing platforms producing short read lengths.
  • some fragments may be missed if not all are sampled.
  • the quality of the starting nucleic acids, including fragmentation, base modification and nicks that are not repaired during the library construction process can lead to gaps in sequences. For example, even with high efficiency, the gap-filling extension or nick ligation may not be 100% efficient during library preparation, an fragments with gaps may be missed in sequencing.
  • the factors above contribute to incomplete sequences. However, with multiple cells or genome input molecules (for example, 50 equivalent genomes), such problems are significantly reduced as long sequence gap in one genome can be covered by another molecule. With more sequencing coverage, less gap will be present. Longer sequencing reads will also help.
  • the frequency of transposition can also be increased to reduce large gaps. If multiple cells are used for sequencing, it is possible to have some cell to cell difference in the sequences, which allows analysis of sequence variation at single cell level, although this is limited by the contig size that can be achieved.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence.
  • step (ii) comprises aligning sequencing reads having the same molecular barcodes in the synthetic transposons and the same duplicated sequences of the single-stranded gaps to provide aligned sequencing reads, and/or step (iii) comprises clustering the sequencing reads based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps.
  • step (iii) comprises deriving a contig from the clustered sequencing reads and removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig to provide the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the sequencing reads are assembled to provide a contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same first molecular barcode and the same second molecular barcode; (iii) determining a consensus sequence for each group of aligned sequencing reads; (iv) linking the consensus sequences together based on the molecular barcodes in the synthetic transposons to provide a contig; and (v) removing the sequences of the synthetic transposons (and if applicable, one copy of the duplicated sequences of the single-stranded gaps) from the contig.
  • step (ii) comprises aligning sequencing reads having the same first molecular barcodes, the same second molecular barcodes, and the same duplicated sequences of the single-stranded gaps; and/or step (iv) comprises linking the consensus sequences together based on the molecular barcodes in the synthetic transposons and the duplicated sequences of the single-stranded gaps to provide the contig.
  • a consensus sequence is determined for each group having at least three aligned sequencing reads.
  • a mismatch nucleotide in a group of aligned sequencing reads is considered to be an amplification or sequencing error if no more than 1/3 or aligned sequencing reads in the group has the mismatch nucleotide.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the sequencing data with the base calls and sample tag information are analyzed through a special pipeline to allow de-multiplexing of samples followed by clustering, error correction and assembly. Sequences of the transposase recognition sites can be used to identify the location of the synthetic transposons in the sequencing reads. In the cases of Tn5 synthetic transposons, a total of 38-bp Tn5 recognition sequences (2xl9-bp,
  • the stuff sequences in the synthetic transposons, or the fixed nucleotides in the molecular barcode sequences can also serve as additional known bases for identification of the synthetic transposons among the sequencing reads.
  • the distinct molecular barcode sequence between the transposase recognition sequences in a synthetic transposon can serve as exogenous tags.
  • the duplicate gap sequences can serve as endogenous tags.
  • Tn5 generates 9-bp duplicated sequences (4 9 or ⁇ 2xl0 5 combinations) flanking the insertion sites, which provides information on the distinct positions of insertion.
  • the duplicated gap sequence can provide additional insertion-specific information for mapping sequencing reads comprising the synthetic transposons to the original location in the target nucleic acid molecule.
  • Tn5 synthetic transposons having 20 randomly designed nucleotides in the molecular barcodes, a total of greater than 2xl0 17 combinations of different sequences can theoretically be used for tagging and extracting contiguity information in a target nucleic acid. This large diversity of molecular barcodes allows the inserted sequences to be different in all positions.
  • each combination of exogenous and optionally endogenous tag sequences uniquely identifies the surrounding sequences from the target nucleic acid.
  • the distinct molecular barcodes and the duplicate gap sequences from target nucleic acids on one or both ends of the synthetic transposon can serve as unique identifiers to cluster sequencing reads with the same molecular barcode and duplicated gap sequence.
  • Amplification or sequencing errors are corrected and amplification bias is eliminated in the clustering process.
  • Such methods can be particularly useful for assembling repetitive sequence regions, such as Alu repeats, so that the contiguity of the repetitive sequences can be resolved. Insertion of the synthetic transposons can break the repetitiveness of many sequences, therefore allow better amplification and sequencing for these sequences that are difficult to amplify or sequence. Consensus sequences derived from the clustered reads are then assembled together to obtain a phased uninterrupted sequence for the target nucleic acid.
  • the synthetic transposons can be identified using the 2 transposase recognition sequences (2xl9-bp for Tn5 transposase recognition sites).
  • the randomly designed sequences in the molecular barcodes (exogenous tags) and/or the duplicate gap sequences flanking the synthetic transposon insertion position (endogenous tags; e.g., 9-nt for Tn5 transposase, which yields 4 9 possible sequences) can be used to trace back the original position of the insertion site in the target nucleic acid and count the original target nucleic acid once for each cluster of reads mapping to the same original target nucleic acid.
  • endogenous tags e.g., 9-nt for Tn5 transposase, which yields 4 9 possible sequences
  • the overlapped sequences among different clustered reads should be the same except for errors from amplification, and/or sequencing, and/or analysis steps. Therefore, a contig representing the error-corrected consensus sequence can be obtained from the sequencing reads clustered based on the sequences of the synthetic transposons and/or the duplicated gap sequences.
  • the library preparation, sequencing, and/or analysis methods described herein may further be supplemented by additional steps and measures in order to obtain high quality, complete sequences in a cost-effective way.
  • the target nucleic acids can be repaired before, during, and/or after transposition; transposition frequency may be increased to minimize the length of sequences between two inserted transposon; loss of nucleic acids may be minimized during processing, for example, by using single-tube processing methods, avoiding purification steps, and/or directly lysing cells to provide target nucleic acids; cluster generation for Illumina sequence can be optimized to allow pair-end sequencing of long templates; the number of cells for each experiment can be optimized; high quality reference sequences can be used; and internal standards may be used for sequencing.
  • the methods of analyzing or sequencing a target nucleic acid as described above can be used in a variety of applications, including, but not limited to high quality sequencing, haplotyping, de novo sequencing, resequencing (such a mutation and cancer sequencing, disease diagnosis, forensic applications, and aging analysis), single-cell sequencing, sequencing of genetic engineered species (such as plants), sequencing of high repetitive regions, pseudogenes and structurally difficult sequences, metagenomics sequencing, structural variation detection, copy number measurement, methylation analysis, genetic linkage analysis for identification of genes involved in disease etiology.
  • the methods have reduced amplification and sequencing errors, and reduced contamination, such as from products of previous experiments.
  • a method of haplotyping a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second
  • the second strand of the first non- complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non- complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of assembly (such as de novo assembly, resequencing, or metagenomic sequencing) of a target nucleic acid (such as genomic DNA, mitochondrial DNA, or microbial DNA), comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g., hyperactive Tn5 transposase) under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid, wherein each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases. In some embodiments, the method determines sequences of the target nucleic acids at single cell level.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • the methods of assembly disclosed herein may be used to generate reference genome sequences for human or other species or interest using multiple platforms or replicates with extreme low error rates (e.g., with lower than about 1/10, 1/100, 1/1000, or 1/10,000 the error rate of current reference genome sequences).
  • the reference genomes can then be used to speed up the assembly process for new sequences from individuals in a species.
  • a 370bp segment in the 5' untranslated region of murine gene Foxd3 is resistant to amplification, sequencing and cloning (Nelms BL and Labosky PA, A predicted hairpin cluster correlates with barriers to PCR, sequencing and possibly BAC recombineering. Scientific Reports 1 : 106, 2011).
  • the random insertion of the synthetic transposons can help to reduce difficulty in sequencing due to repetitive or hairpin cluster sequences.
  • a method of sequencing repetitive regions in a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a mutation comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non- complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the
  • the molecular barcode is double- stranded.
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a structural variation in a target nucleic acid comprising: (a) (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • a method of detecting a copy number variation in a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid.
  • the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the method further comprises capturing or enhancing the target nucleic acid or barcoded target nucleic acid, such as by using probes that hybridize to the target nucleic acid or barcoded target nucleic acid.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • DNA methylation is a widespread epigenetic modification that plays a pivotal role in the regulation of the genomes of diverse organisms.
  • the most prevalent and widely studied form of DNA methylation in mammalian genomes occurs at the 5 carbon position of cytosine residues, usually in the context of the CpG dinucleotide.
  • Microarrays, and more recently massively parallel sequencing, have enabled the interrogation of cytosine methylation (5mC) on a genome-wide scale (Zilberman and Henikoff 2007).
  • Methods of whole genome bisulfite sequencing that can be used to detect 5mC have been described (e.g., Cokus et al. 2008; Lister et al. 2009; Harris et al. 2010).
  • Treatment of genomic DNA with sodium bisulfite chemically deaminates cytosines much more rapidly than 5mC, preferentially converting them to uracils (Clark et al. 1994).
  • massively parallel sequencing these can be detected on a genome-wide scale at single base -pair resolution.
  • Any of the known whole genome bisulfite sequencing workflows can be applied to genomic DNA samples barcoded with the synthetic transposons of the present application to provide methods of methylation analysis with high accuracy and efficiency.
  • a method of analyzing methylation status of a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non- complementary region, and a second fragment comprising a second stem and a second non- complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non- complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non- complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single- stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • chromosome conformation capture techniques see, for example, Barutcus AR et al, J. Cell Physiol, 231 :31-35, 2016), such as 3C, circularized 3C (i.e. , 4C), carbon-copy 3C (i.e. , 5C), or chromatin immunoprecipitation-based methods (such as ChlP-loop), and genome conformation capture techniques may be combined with any one of the methods of inserting synthetic transposons described herein to assess chromosome interactions.
  • chromatin immunoprecipitation-based methods such as ChlP-loop
  • Chromatation methods can be used to isolate protein-DNA complexes (such as chromatin-DNA complexes), which can then be barcoded with the synthetic transposons of the present application, and sequenced to determine the location in the genome that the protein (such as histones) are associated with.
  • protein-DNA complexes such as chromatin-DNA complexes
  • a method of analyzing conformation of a chromosome comprising: (a) crosslinking the chromosome in vivo (such as within a cell); (b) isolating the crosslinked chromosome; (c) fragmenting (such as mechanically or enzymatically) the crosslinked chromosome to provide crosslinked chromosomal fragments; (d) ligating the ends of the crosslinked chromosomal fragments to provide ligated fragments; (e) reversing the ligated fragments to provide target nucleic acids; (f) contacting the target nucleic acids with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • cleavable nucleotides such as a uracil nucleotide
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single- stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • the method further comprises counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • any of the methods and applications described above can be used for diagnosing a disease or a condition in an individual based on the sequence, contiguity information (such as haplotype or 3-dimensional chromosome conformation), and/or quantity of a target nucleic acid in the individual.
  • the target nucleic acid may be present in a sample obtained from the individual, including, but not limited to, biopsy sample, buccal swap, blood sample, or sample of other bodily fluid.
  • the target nucleic acid of the individual is compared to a reference from a healthy individual to provide the diagnosis.
  • a method of diagnosing a disease or a condition of an individual based on status of a target nucleic acid comprising: (a) contacting the target nucleic acid with a composition comprising a plurality of synthetic transposons and a transposase (such as Tn5 transposase, e.g.
  • each synthetic transposon comprises a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising
  • the diagnosis comprises mutations, such as structural variations or copy number variations in a diseased tissue (such as tumor).
  • the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides (such as a uracil nucleotide).
  • the first strand and the second strand of the first non-complementary region are fused to each other via a first single-stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are hybridized to each other.
  • the synthetic transposon further comprises a bridge nucleic acid comprising a first single- stranded sequence that hybridizes to the first single-stranded linker, and a second single-stranded sequence that hybridizes to the second single-stranded linker, and wherein the first single - stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • the sequencing is next generation sequencing.
  • the sequencing is massively parallel shotgun sequencing.
  • the sequencing is single molecule sequencing.
  • the method further comprises amplifying the library of template nucleic acids, such as by PCR or by RCA.
  • the plurality of synthetic transposons and the transposase are pre- mixed prior to contacting the target nucleic acid. In some embodiments, the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • the sequencing reads are assembled to provide the contiguous sequence of the target nucleic acid by steps comprising: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same molecular barcodes in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the molecular barcodes in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single- stranded gaps having duplicated sequences endogenous to the target nucleic acid
  • the duplicated sequences are further used to assemble the contiguous sequence.
  • Some embodiments described herein comprise comparing the contiguous sequence of the target nucleic acid in a sample to a reference sequence, the copy number of the target nucleic acid in a sample to a reference value, and/or comparing the contiguous sequence and/or copy number of the target nucleic acid of one sample to that of a reference sample.
  • the reference sequence and reference values may be obtained from a database.
  • the reference sample may be a sample from a healthy or wildtype individual, tissue, or cell. For example, in some
  • the target nucleic acid from a tumor cell of an individual is analyzed and compared to the nucleic acid from a healthy cell of the same individual to provide a diagnosis.
  • kits and articles of manufacture comprising a plurality of any of the synthetic transposons described herein, and for methods of library preparation, analyzing target nucleic acids, or various applications described herein.
  • kits for preparing a library of template nucleic acids comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non- complementary region and the second non-complementary region are connected to each other; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c)
  • the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5TM. In some embodiments, wherein the first non-complementary region and/or the second non-complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USERTM.
  • UDG for example, USERTM.
  • each synthetic transposon further comprises a bridge nucleic acid
  • the kit further comprises a single-strand exonuclease.
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • a kit for preparing a library of template nucleic acids comprising: (a) a plurality of synthetic transposons each comprising a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and a first molecular barcode disposed between the first transposase recognition site and the first non-complementary region; wherein the second stem comprises a second transposase recognition site and a second molecular barcode disposed between the second transposase recognition site and the second non-complementary region; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non-complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the
  • the kit further comprises a polymerase without strand displacement activity, such as T4 DNA polymerase. In some embodiments, the kit further comprises a ligase. In some embodiments, the kit further comprises nucleotides (such as dNTPs). In some embodiments, the transposase is Tn5 transposase, including a modified Tn5 transposase with enhanced activity, such as EZ-Tn5TM. In some embodiments, wherein the first non-complementary region and/or the second non- complementary region comprises one or more cleavable nucleotides, the kit further comprises an endonuclease, such as UDG, for example, USERTM.
  • UDG for example, USERTM.
  • each synthetic transposon further comprises a bridge nucleic acid
  • the kit further comprises a single- strand exonuclease.
  • the kit further comprises a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • the kit further comprises a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • kits may contain one or more additional components, such as containers, buffers, reagents, cofactors, or additional agents, such as denaturing agent.
  • additional components such as containers, buffers, reagents, cofactors, or additional agents, such as denaturing agent.
  • the kit components may be packaged together and the package may contain or be accompanied by instructions for using the kit.
  • Embodiment 1 A synthetic transposon comprising: a first fragment comprising a first stem and a first non-complementary region, and a second fragment comprising a second stem and a second non-complementary region; wherein the first stem comprises a first transposase recognition site and the second stem comprises a second transposase recognition site; wherein the first non-complementary region comprises a first strand comprising a first adapter sequence, and a second strand comprising a second adapter sequence; wherein the second non- complementary region comprises a first strand comprising the second adapter sequence, and a second strand comprising the first adapter sequence; and wherein the first non-complementary region and the second non-complementary region are connected to each other.
  • Embodiment 2 The synthetic transposon of embodiment 1, wherein the first strand of the first non-complementary region is fused to the first strand of the second non-complementary region via one or more cleavable nucleotides.
  • Embodiment 3 The synthetic transposon of embodiment 1 or embodiment 2, wherein the second strand of the first non-complementary region is fused to the second strand of the second non-complementary region via one or more cleavable nucleotides.
  • Embodiment 4 The synthetic transposon of embodiment 1, wherein the first strand and the second strand of the first non-complementary region are fused to each other via a first single- stranded linker, wherein the first strand and the second strand of the second non-complementary region are fused to each other via a second single-stranded linker, and wherein the first single- stranded linker and the second single-stranded linker are connected to each other.
  • Embodiment 5 The synthetic transposon of embodiment 4, wherein the first single- stranded linker and the second single-stranded linker hybridize to each other.
  • Embodiment 6 The synthetic transposon of embodiment 5, wherein each of the first single-stranded linker and the second single-stranded linker comprises a cleavable nucleotide.
  • Embodiment 7 The synthetic transposon of embodiment 4, further comprising a bridge nucleic acid comprising a first single-stranded sequence that hybridizes to the first single - stranded linker, and a second single-stranded sequence that hybridizes to the second single- stranded linker, and wherein the first single-stranded linker and the second single-stranded linker are connected to each other via hybridization to the bridge nucleic acid.
  • Embodiment 8 The synthetic transposon of any one of embodiments 2, 3, and 6, wherein the cleavable nucleotide is a uracil nucleotide.
  • Embodiment 9 The synthetic transposon of any one of embodiments 1-8, wherein the first stem or the second stem comprises a terminal hairpin structure.
  • Embodiment 10 The synthetic transposon of any one of embodiments 1-8, wherein the first stem and the second stem comprise blunt ends.
  • Embodiment 11 The synthetic transposon of any one of embodiments 1-10, wherein the synthetic transposon is a DNA transposon.
  • Embodiment 12 The synthetic transposon of any one of embodiments 1-11, wherein the synthetic transposon comprises one or more modified nucleotides.
  • Embodiment 13 The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have the same sequence.
  • Embodiment 14 The synthetic transposon of any one of embodiments 1-12, wherein the first transposase recognition site and the second transposase recognition site have different sequences.
  • Embodiment 15 The synthetic transposon of any one of embodiments 1-14, wherein the first transposase recognition site and/or the second transposase recognition site is a mosaic element (ME).
  • ME mosaic element
  • Embodiment 16 The synthetic transposon of any one of embodiments 1-15, further comprising a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non- complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region.
  • Embodiment 17 The synthetic transposon of embodiment 16, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence.
  • Embodiment 18 The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode and the second molecular barcode are double-stranded.
  • Embodiment 19 The synthetic transposon of embodiment 16 or embodiment 17, wherein the first molecular barcode or the second molecular barcode comprises a single-stranded region.
  • Embodiment 20 The synthetic transposon of embodiment 19, wherein the 5' terminus adjacent to the single-stranded region is phosphorylated.
  • Embodiment 21 A composition comprising a plurality of synthetic transposons of any one of embodiments 1-20.
  • Embodiment 22 The composition of embodiment 21, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, and wherein each synthetic transposon has a different barcode sequence.
  • Embodiment 23 The composition of embodiment 22, wherein the barcode sequence of each synthetic transposon comprises at least about 5 randomly or degenerately designed nucleotides.
  • Embodiment 24 A method of preparing a library of template nucleic acids, comprising: (a) contacting a target nucleic acid with the composition of any one of embodiments 21-23, and a transposase under a condition that allows insertion of at least a portion of the plurality of synthetic transposons into the target nucleic acid to provide an inserted target nucleic acid; (b) contacting the inserted target nucleic acid with a polymerase, nucleotides, and a ligase to provide a repaired target nucleic acid; and (c) separating the first non-complementary region and the second non-complementary region of each synthetic transposon inserted in the repaired target nucleic acid to provide the library of template nucleic acids.
  • Embodiment 25 The method of embodiment 24, wherein step (c) comprises treating the repaired target nucleic acid with an endonuclease.
  • Embodiment 26 The method of embodiment 25, wherein the endonuclease is uracil DNA glycosylase (UDG).
  • UDG uracil DNA glycosylase
  • Embodiment 27 The method of embodiment 24, wherein step (c) comprises denaturing of the repaired target nucleic acid.
  • step (c) comprises denaturing of the repaired target nucleic acid.
  • Embodiment 28 The method of embodiment 27, further comprising treating the denatured repaired target nucleic acid with an exonuclease.
  • Embodiment 29 The method of any one of embodiments 24-28, further comprising amplifying the library of template nucleic acids.
  • Embodiment 30 The method of embodiment 29, wherein the amplifying is whole- genome amplification.
  • Embodiment 31 The method of embodiment 29, wherein the amplifying is targeted amplification.
  • Embodiment 32 The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by a polymerase chain reaction (PCR).
  • PCR polymerase chain reaction
  • Embodiment 33 The method of embodiment 32, wherein the PCR comprises contacting the template nucleic acids with a first primer that hybridizes to the first adapter sequence or reverse complement thereof, and a second primer that hybridizes to the second adapter sequence or reverse complement thereof.
  • Embodiment 34 The method of any one of embodiments 29-31 , wherein the library of template nucleic acids is amplified by rolling circle amplification (RCA) using a primer comprising a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • RCA rolling circle amplification
  • Embodiment 35 The method of embodiment 34, further comprising circularizing the template nucleic acids prior to the RCA.
  • Embodiment 36 The method of any one of embodiments 24-35, wherein the polymerase is T4 DNA polymerase.
  • Embodiment 37 The method of any one of embodiments 24-36, wherein the transposase is Tn5 transposase.
  • Embodiment 38 The method of any one of embodiments 24-37, wherein the plurality of synthetic transposons and the transposase are pre-mixed prior to contacting the target nucleic acid.
  • Embodiment 39 The method of any one of embodiments 24-38, wherein the target nucleic acid is selected from the group consisting of cDNA, genomic DNA, bisulfite-treated DNA, and crosslinked DNA.
  • Embodiment 40 The method of any one of embodiments 24-39, wherein the plurality of synthetic transposons is inserted into the target nucleic acid at a frequency of at least once per about 500 bases.
  • Embodiment 41 A method of analyzing a target nucleic acid, comprising: (a) preparing a library of template nucleic acids from the target nucleic acid using the method of any one of embodiments 24-40; and (b) sequencing the library of template nucleic acids to obtain sequencing reads.
  • Embodiment 42 The method of embodiment 41, wherein the sequencing is massively parallel shotgun sequencing.
  • Embodiment 43 The method of embodiment 41, wherein the sequencing is single molecule sequencing.
  • Embodiment 44 The method of any one of embodiments 41-43, wherein each synthetic transposon comprises a first molecular barcode and a second molecular barcode, wherein the first molecular barcode is disposed between the first transposon recognition site and the first non-complementary region, and the second molecular barcode is disposed between the second transposon recognition site and the second non-complementary region, wherein the first molecular barcode and the second molecular barcode have the same barcode sequence, wherein each synthetic transposon has a different barcode sequence, and wherein the method further comprises: (c) assembling a contiguous sequence of the target nucleic acid from the sequencing reads based on the barcode sequences of the synthetic transposons in the template nucleic acids.
  • Embodiment 45 The method of embodiment 44, wherein step (c) comprises: (i) identifying sequences of the synthetic transposons in the sequencing reads; (ii) aligning sequencing reads having the same barcode sequences in the synthetic transposons to provide aligned sequencing reads; and (iii) clustering the aligned sequencing reads based on the barcode sequences in the synthetic transposons to provide the contiguous sequence of the target nucleic acid.
  • Embodiment 46 The method of embodiment 44 or embodiment 45, wherein each synthetic transposon inserted in the target nucleic acid is flanked by a pair of single-stranded gaps having duplicated sequences endogenous to the target nucleic acid, and wherein the duplicated sequences are used to assemble the contiguous sequence.
  • Embodiment 47 The method of any one of embodiments 44-46, further comprising counting one copy of the target nucleic acid for all sequencing reads assembled to the contiguous sequence.
  • Embodiment 48 The method of any one of embodiments 41-47, wherein the method is used for genome assembly, haplotyping, detection of mutation, chromosomal conformation analysis, or methylation analysis.
  • Embodiment 49 The method of embodiment 48, wherein the mutation is selected from the group consisting of substitution, indel, structural variation, and copy number variation.
  • Embodiment 50 A kit for preparing a library of template nucleic acids, comprising: (a) the composition of any one of embodiments 21-23; (b) a transposase that recognizes the first transposon recognition site and the second transposon recognition site; and (c) instructions for preparing the library of template nucleic acids.
  • Embodiment 51 The kit of embodiment 50, further comprising a polymerase.
  • Embodiment 52 The kit of embodiment 51, wherein the polymerase is a T4 DNA polymerase.
  • Embodiment 53 The kit of any one of embodiments 50-52, further comprising a ligase.
  • Embodiment 54 The kit of any one of embodiments 50-53, wherein the transposase is Tn5 transposase.
  • Embodiment 55 The kit of any one of embodiments 50-54, further comprising an endonuclease.
  • Embodiment 56 The kit of embodiment 55, wherein the endonuclease is UDG.
  • Embodiment 57 The kit of any one of embodiments 50-56, further comprising a first primer and a second primer for PCR amplification of the library of template nucleic acids, wherein the first primer hybridizes to the first adapter sequence or reverse complement thereof, and the second primer hybridizes to the second adapter sequence or reverse complement thereof.
  • Embodiment 58 The kit of any one of embodiments 50-56, further comprising a primer for rolling circle amplification of the library of template nucleic acids, wherein the primer comprises a first sequence that hybridizes to the first adapter sequence and a second sequence that hybridizes to the second adapter sequence.
  • Identical twins have identical genomic sequences except for only a few mutations.
  • the specific mutations can be determined by NGS methods, and confirmed by Sanger sequencing methods. Therefore, data from whole genome sequencing of identical twins can be used for checking sequencing errors using the library preparation methods described in the present application.
  • An exemplary method of whole genome sequencing of identical human twins is described below.
  • Human gDNA is extracted from a buccal swap or a drop of blood, and the purity and yield of the gDNA is measured. Alternatively, about 10-20 human cells from each person are lysed without purification to minimize the loss of DNA.
  • a composition comprising a plurality of synthetic transposons as shown in FIG. 21 is prepared. Illumina sequencing primers readl and read2 are incorporated as the first adapter sequence (e.g. , F) and second adapter sequence (e.g. , R) respectively in the non-complementary regions.
  • the molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides.
  • Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared.
  • about 0.3 ng gDNA is used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 150- bp between adjacent transposition sites.
  • the single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase.
  • nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step.
  • the product is treated with USERTM enzyme (NEB) to cleave the uracil nucleotide joining the two non-complementary regions in the inserted synthetic transposons to provide a library of templates.
  • the USERTM enzyme treatment step can be done separately or simultaneously with the gap filling step and the ligation step.
  • the library of templates is subsequently PCR amplified with two corresponding primers: a first primer having Illumina sequence i5 and sequence Readl, and a second primer having Illumina sequence i7 and sequence Read2.
  • the PCR products are then purified, quantified, and sequenced with 2x300 bases pair-end reads using an Illumina NGS instrument.
  • the sequencing reads are subsequently analyzed.
  • the sequencing reads contain stuff sequences, unique molecular barcode sequence, 19-base Tn5 recognition site, 9-base duplicate sequence, and target sequence in both sequencing directions.
  • the sequencing reads may additionally contain an additional copy of 9-base duplicate, 19-base Tn5 recognition site, unique molecular barcode sequence and stuff sequences.
  • the sequencing reads in both directions are matched with each other and combined to yield a single sequence.
  • sequences having identical molecular barcodes are aligned and merged into a single consensus sequence to yield the error-corrected target sequence.
  • Target sequences are assembled to provide whole genome sequence with high quality, which contains haplotype information and any structural variation or mutations.
  • the genomic sequences from the twins are compared to each other to identify mutations, which are verified by Sanger sequencing. Unverified mutations are attributed to sequencing errors, and used to calculate an error rate for the sequencing method described herein, and compared to error rates using other sequencing method, which uses conventional methods (such as commercial kits) to prepare sequencing libraries.
  • microbial gDNAs are extracted from human skin surface using a swap-scrape- swap procedure. The purity and yield of the microbial gDNAs are measured.
  • a composition comprising a plurality of synthetic transposons as shown in FIG. 2H is prepared.
  • PacBio adapter sequences are incorporated as the first and second adapter sequences (i.e., F and R) respectively in the two non-complementary regions.
  • the molecular barcodes contain a total of 20 randomly or degenerately designed nucleotides intermixed with fixed nucleotides. Duplicate samples of the gDNA inserted with the plurality of synthetic transposons are prepared.
  • nanograms of gDNA are used to contact with the composition comprising the plurality of synthetic transposons, and Tn5 transposase under a condition that allows insertion at a frequency of about 1500-bp between adjacent transposition sites.
  • the single-stranded gaps are filled-in with dNTPs and a DNA polymerase without strand displacement activity, such as T4 DNA polymerase.
  • nicks in the DNA are ligated with E. coli ligase, which can be done separately or simultaneously with the gap filling step.
  • the product is subsequently denatured, and treated with exonucleases to remove any linear nucleic acids.
  • the resulting sample is sequenced with a PacBio SMRT ® instrument.
  • Sequencing data is analyzed, and microbial genomes are assembled from the sequencing data. Abundance of each microbial genome is also obtained. The data is further compared to metagenome data in databases. In this case, molecular barcodes are mainly used to link as many fragments as possible from the same original genome.

Abstract

La présente invention concerne des transposons synthétiques ayant deux régions non complémentaires comprenant des séquences d'adaptateur et liées l'une à l'autre. Les transposons synthétiques peuvent en outre comprendre des codes-barres moléculaires. L'invention concerne également des compositions comprenant une pluralité de transposons synthétiques, des procédés et des kits pour la préparation de bibliothèques. Les compositions, les procédés, les kits et les outils d'analyse décrits ici ont de nombreuses applications, y compris le séquençage de haute qualité, l'haplotypage, la correction d'erreurs, le séquençage de régions répétitives, la détection de variations structurales et de variations du nombre de copies, l'analyse de méthylation et la quantification d'acides nucléiques cibles.
PCT/US2017/052776 2016-09-23 2017-09-21 Compositions de transposons synthétiques et leurs procédés d'utilisation WO2018057779A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662399188P 2016-09-23 2016-09-23
US62/399,188 2016-09-23

Publications (1)

Publication Number Publication Date
WO2018057779A1 true WO2018057779A1 (fr) 2018-03-29

Family

ID=61691129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/052776 WO2018057779A1 (fr) 2016-09-23 2017-09-21 Compositions de transposons synthétiques et leurs procédés d'utilisation

Country Status (1)

Country Link
WO (1) WO2018057779A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10227574B2 (en) 2016-12-16 2019-03-12 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
WO2021077415A1 (fr) * 2019-10-25 2021-04-29 Peking University Détection et analyse de méthylation d'adn de mammifère
US11278570B2 (en) 2016-12-16 2022-03-22 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11760983B2 (en) 2018-06-21 2023-09-19 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012061832A1 (fr) * 2010-11-05 2012-05-10 Illumina, Inc. Liaison entre des lectures de séquences à l'aide de codes marqueurs appariés
US20130289251A1 (en) * 2010-12-23 2013-10-31 Roche Diagnostics Operations, Inc. Binding agent
US20150176071A1 (en) * 2013-12-20 2015-06-25 Illumina, Inc. Preserving genomic connectivity information in fragmented genomic dna samples
US20150368638A1 (en) * 2013-03-13 2015-12-24 Illumina, Inc. Methods and compositions for nucleic acid sequencing
WO2016061517A2 (fr) * 2014-10-17 2016-04-21 Illumina Cambridge Limited Transposition conservant la contiguïté
US20160177359A1 (en) * 2014-02-03 2016-06-23 Thermo Fisher Scientific Baltics Uab Method for controlled dna fragmentation
WO2016130704A2 (fr) * 2015-02-10 2016-08-18 Illumina, Inc. Procédés et compositions pour analyser des composants cellulaires

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012061832A1 (fr) * 2010-11-05 2012-05-10 Illumina, Inc. Liaison entre des lectures de séquences à l'aide de codes marqueurs appariés
US20130289251A1 (en) * 2010-12-23 2013-10-31 Roche Diagnostics Operations, Inc. Binding agent
US20150368638A1 (en) * 2013-03-13 2015-12-24 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20150176071A1 (en) * 2013-12-20 2015-06-25 Illumina, Inc. Preserving genomic connectivity information in fragmented genomic dna samples
US20160177359A1 (en) * 2014-02-03 2016-06-23 Thermo Fisher Scientific Baltics Uab Method for controlled dna fragmentation
WO2016061517A2 (fr) * 2014-10-17 2016-04-21 Illumina Cambridge Limited Transposition conservant la contiguïté
WO2016130704A2 (fr) * 2015-02-10 2016-08-18 Illumina, Inc. Procédés et compositions pour analyser des composants cellulaires

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10227574B2 (en) 2016-12-16 2019-03-12 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11111483B2 (en) 2016-12-16 2021-09-07 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems and methods
US11162084B2 (en) 2016-12-16 2021-11-02 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11278570B2 (en) 2016-12-16 2022-03-22 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
US11760983B2 (en) 2018-06-21 2023-09-19 B-Mogen Biotechnologies, Inc. Enhanced hAT family transposon-mediated gene transfer and associated compositions, systems, and methods
WO2021077415A1 (fr) * 2019-10-25 2021-04-29 Peking University Détection et analyse de méthylation d'adn de mammifère
CN114391043A (zh) * 2019-10-25 2022-04-22 北京大学 哺乳动物dna的甲基化检测及分析
CN114391043B (zh) * 2019-10-25 2024-03-15 昌平国家实验室 哺乳动物dna的甲基化检测及分析

Similar Documents

Publication Publication Date Title
US11319534B2 (en) Methods and compositions for nucleic acid sequencing
US11505795B2 (en) Error detection in sequence tag directed sequencing reads
US20180087050A1 (en) Methods of inserting molecular barcodes
US20220275437A1 (en) Methods for assembling and reading nucleic acid sequences from mixed populations
EP2427569B1 (fr) Utilisation d'endonucléases à restriction de classe iib dans des applications de séquençage de 2ème génération
IL287853B2 (en) Continuity-preserving transposition
US20120003657A1 (en) Targeted sequencing library preparation by genomic dna circularization
US20140228223A1 (en) High throughput paired-end sequencing of large-insert clone libraries
US20220127597A1 (en) Haplotagging - haplotype phasing and single-tube combinatorial barcoding of nucleic acid molecules using bead-immobilized tn5 transposase
JP2009529876A (ja) 核酸を配列決定するための方法および手段
US20200283839A1 (en) Methods of attaching adapters to sample nucleic acids
WO2018057779A1 (fr) Compositions de transposons synthétiques et leurs procédés d'utilisation
US20210403904A1 (en) Methods for haplotyping with short read sequence technology
WO2012008831A1 (fr) Génération simplifiée de cartes physiques de novo à partir de banques de clones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17853918

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17853918

Country of ref document: EP

Kind code of ref document: A1