WO2023012195A1 - Method - Google Patents

Method Download PDF

Info

Publication number
WO2023012195A1
WO2023012195A1 PCT/EP2022/071761 EP2022071761W WO2023012195A1 WO 2023012195 A1 WO2023012195 A1 WO 2023012195A1 EP 2022071761 W EP2022071761 W EP 2022071761W WO 2023012195 A1 WO2023012195 A1 WO 2023012195A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
sample
target nucleotide
ligated
generated
Prior art date
Application number
PCT/EP2022/071761
Other languages
French (fr)
Inventor
Max Jan van Min
Original Assignee
Cergentis B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cergentis B.V. filed Critical Cergentis B.V.
Publication of WO2023012195A1 publication Critical patent/WO2023012195A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1093General methods of preparing gene libraries, not provided for in other subgroups
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay

Definitions

  • the present invention relates to the field of molecular biology and more in particular to DNA technology.
  • the invention in more detail relates to the sequencing of DNA.
  • the invention relates to strategies for determining (part of) a DNA sequence of a genomic region of interest and the detection of single nucleotide variants (SNVs) and structural variants, including previously unknown variants.
  • SNVs single nucleotide variants
  • TLA Targeted Locus Amplification
  • TLA and other physical proximity protocols include an initial crosslinking step and are based on the concept that, in general, the chance of different fragments being crosslinked correlates inversely with the linear distance, i.e.
  • DNA fragments that ligate to a DNA fragment comprising the target nucleotide sequence are representative of the genomic region of interest comprising the target nucleotide sequence.
  • the TLA approach involves crosslinking of DNA, fragmenting the crosslinked DNA (e.g. with a restriction enzyme), followed by ligation of the crosslinked DNA fragments.
  • the ligated DNA fragments comprising the target nucleotide sequence, and thus the genomic region of interest may be enriched, e.g. by PCR.
  • the sequence of the genomic region of interest on the linear chromosome template can subsequently be determined using (high throughput) sequencing technologies well known in the art (see WO2012/005595 and de Vree et a! , Nature Biotechnology; 32; 1019-1025; 2014).
  • TLA Proximity-based sequencing techniques, such as TLA, are thus able to enrich for unknown sequences resulting from to be detected single nucleotide and structural genetic variation in clinically relevant genes.
  • TLA enables the targeted enrichment and sequencing of any locus or (trans)gene of interest and allows for detection of single nucleotide variants (SNVs) and structural variants, including previously unknown variants.
  • SNVs single nucleotide variants
  • the TLA technology can be applied on cells, HMW DNA and FFPE samples (e.g. FFPE tumour biopsies).
  • any ligation product comprising a potential target nucleotide sequence which is not subject to enrichment - and thus does not contribute to the sequencing output - represents lost sequence information from the genomic region of interest.
  • the ligation products generated during proximity-based sequencing techniques are not optimally compatible with standard techniques used to enrich DNA for multiple target sequences.
  • the ligation products generated during TLA are not optimally amenable to multiplex PCR, wherein primers directed against multiple target nucleotide sequences would be used to enrich for ligation products comprising those target nucleotide sequences.
  • the nature of the randomised reshuffling of DNA fragments that occurs during the initial fragmentation and re-ligation steps of a TLA protocol means that (i) the TLA protocol will result in a mixture of different unique ligation products and (ii) one cannot predict the identity and organisation of individual DNA fragments within the ligation products produced. This is relevant in respect of the presence, position, and orientation of fragments within each essentially unique ligation product.
  • the use of additional target nucleotide sequences can hamper the enrichment of complete ligation products and result in aberrant and/or shorter enrichment products.
  • the amplification product generated by two primers directed against those target nucleotide sequences will be subject to a very preferential amplification in view of the short amplicon size from that specific ligation product.
  • the preferential amplification of two adjacent DNA fragments each comprising a target nucleotide will result in minimal de novo sequence information, and the loss of sequence information from the same ligation product comprising the two target nucleotide sequences from which longer amplicons comprising the target nucleotide sequence, and potential unknown sequences, could have been generated just using a single primer for a single target nucleotide sequence.
  • the present invention is directed to methods for enriching DNA from a genomic region of interest, which methods facilitate the use of multiple target nucleotide sequences within the genomic region of interest during enrichment strategies.
  • the present methods comprise a step of performing a non-selective amplification of the ligated DNA products generated from the initial fragmentation and ligation steps of a proximity- based sequencing technique, such as TLA.
  • the non-selective amplification of the ligated DNA products generates multiple copies of each unique ligation product that was present in the initial starting material (see Figure 1).
  • the amplified, ligated DNA generated is separated into at least a first sample and a second sample. Multiple, separate enrichments are then performed using primers, for example, directed against different (combinations of) target nucleotide sequences.
  • primers for example, directed against different (combinations of) target nucleotide sequences.
  • the non-selective (universal) amplification followed by physical separation and separate enrichments enables comprehensive amplification of an increased number of ligation products comprising a greater number of target nucleotide sequences from a genomic region of interest.
  • the present methods thus enable the effective use of multiple target nucleotide sequences in protocols that maximize, for example, the length and number of enriched ligation products.
  • the present invention enables improved enrichment of DNA fragments which do not comprise a target nucleotide sequence from a genomic region of interest comprising a plurality of target nucleotide sequences. As such, the present invention improves the ability to determine unknown sequences within a genomic region of interest.
  • the present invention provides a method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the ampl
  • the present invention provides a method for enriching DNA fragments which do not comprise a target nucleotide sequence from a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of
  • the invention provides a method for making a DNA sequencing library of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for
  • the invention provides a method for determining the sequence of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for
  • a non-selective amplification prior to a physical separation and enrichment is particularly advantageous in a proximity-based sequencing technique because of the fact that each ligation product produced during the initial fragmentation and ligation procedure is essentially unique. Accordingly, performing a non-selective amplification to amplify essentially all ligation products means that sequence information from a specific ligation product will not be lost because that ligation product is subject to an enrichment that is not directed to the target nucleotide sequence it contains or the ligation products consist of a combination of DNA fragments that result in short amplicons in a multiplex PCR.
  • the present methods thus include, but are not limited to, the ability to maximise the length of PCR amplicons comprising target nucleotide sequences generated and thus increase amount the sequence information subsequently retrievable; increasing the number of ligation products that can be meaningfully enriched based on one or more target nucleotide sequences; and increasing the number distinct PCR amplicons that can be generated from ligation products comprising one or more target nucleotide sequences.
  • the present methods may therefore maximise the enrichment of unknown sequences which are present in ligation products comprising a target nucleotide sequence generated during a proximity-based sequencing library preparation.
  • the methods also provide advantages in the analysis of samples that contain a low number of copies of a (trans)gene of interest. For instance, viral transgene sequences can integrate and occur in low frequencies. For reasons described above, the method improves the efficiency and completeness with which such transgene sequences and their integration sites can be enriched and sequenced.
  • the methods of the invention also present advantages in the analysis of samples with heterogeneous I random transgene integration sites.
  • Heterogeneous I random transgene integrations occur in a wide variety of sample types that occur naturally or are generated in industry. Random transgene integrations for instance occur in a variety of gene therapy products. In such samples, TLA-based detection of integration sites depends on the sequencing of the breakpoint sequence between the transgene genome and the host genome at the position of its integration site. However, given the nature of the TLA protocol, a monoplex PCR specific for a certain target nucleotide sequence will only enrich the subset of DNA fragments that have ended-up in the same ligation product. Therefore, only a subset of breakpoint sequences of interest will occur in the ligation products that are enriched with a single transgene specific target nucleotide sequence. A monoplex TLA based enrichment will thus only enable the enrichment and sequencing of a subset of such breakpoint sequences.
  • Multiplex enrichment is thus required to increase the number of to be sequenced breakpoint sequences.
  • multiplex enrichment based on multiple transgene specific target nucleotide sequences suffers from the drawbacks described above.
  • each ligation product comprising a breakpoint sequence is unique and will thus only be enriched if it ends up in the sample that is enriched with a transgene specific target nucleotide sequence that also occurs in the same unique ligation product.
  • step e) means that copies of each ligation product will occur in each separated sample and that each breakpoint sequence that occurs in a ligation product generated in step c) that comprises at least one of the target nucleotide sequences used in the multiple enrichments will be enriched and sequenced.
  • FIG. 1 Schematic overview of an illustrative method of the invention
  • Figure 2 A graphical illustration of the result of a multiplex PCR using primers complimentary to four target nucleotide sequences and a universal adaptor sequence.
  • Preferential amplification of products generated by primers directed against two proximal target nucleotide sequences results in the generation of short amplification products with minimal de novo sequence information, and the loss of sequence information from the same ligation product comprising the two target nucleotide sequences from which longer amplicons comprising the target nucleotide sequence, and potential unknown sequences, could have been generated just using a single primer for a single target nucleotide sequence.
  • Figure 3 A graphical illustration of the result of a separate PCR amplifications on the result of the non-selective amplification of the ligation product shown in Figure 2 using primers complimentary to four target nucleotide sequences and a universal adaptor sequence.
  • a method for isolating "a" DNA molecule includes isolating a plurality of molecules (e.g. 10's, 100's, 1000 's, 10's of thousands, 100's of thousands, millions, or more molecules).
  • nucleic acid may include any polymer or oligomer of pyrimidine and purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (see Albert L. Lehninger, Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which is herein incorporated by reference in its entirety for all purposes).
  • the present invention contemplates any deoxyribonucleotide, ribonucleotide or peptide nucleic acid component, and any chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of these bases, and the like.
  • the polymers or oligomers may be heterogeneous or homogenous in composition, and may be isolated from naturally occurring sources or may be artificially or synthetically produced.
  • the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single-stranded or doublestranded form, including homoduplex, heteroduplex, and hybrid states.
  • aligning and “alignment” is meant the comparison of two or more nucleotide sequence based on the presence of short or long stretches of identical or similar nucleotides.
  • Methods and computer programs for alignment are well known in the art.
  • One computer program which may be used or adapted for aligning is "Align 2", authored by Genentech, Inc., which was filed with user documentation in the United States Copyright Office, Washington, D.C. 20559, on Dec. 10, 1991.
  • the present invention provides a method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified,
  • genomic region of interest refers to a DNA sequence of an organism of which it is desirable to determine, at least part of, the DNA sequence.
  • a genomic region which comprises, or is suspected of comprising, an allele associated with a disease may be a genomic region of interest.
  • a genomic region which comprises a vector insertion site is another example.
  • the whole genome sequence may be determined, suitably following deselection of episomal copies of the vector I transgene sequence.
  • the genomic region of interest is the whole genome.
  • target nucleotide sequence refers to a DNA sequence of interest within a genomic region of interest.
  • the target nucleotide sequence may be a transgene or a portion thereof.
  • the target nucleotide sequence may be an allele or a portion thereof.
  • the target nucleotide sequence is used in the enrichment steps described herein.
  • the present methods comprise enrichment for a genomic region of interest comprising a plurality of target nucleotide sequences.
  • the present methods may comprise performing an enrichment for at least two, at least three, at least four, at least five, at least eight or at least ten target nucleotide sequences.
  • DNA fragments that originate from a genomic region of interest remain in proximity of each other because they are crosslinked.
  • DNA fragments of the genomic region of interest which are in the proximity of each other due to the crosslinks, are ligated.
  • This type of ligation is also referred to as proximity ligation.
  • DNA fragments comprising the target nucleotide sequence may ligate with DNA fragments within a large linear distance at the sequence level.
  • Each individual target nucleotide sequence is likely to be crosslinked to multiple other DNA fragments.
  • often more than one DNA fragment may be ligated to a fragment comprising the target nucleotide sequence and, in a sample comprising multiple copies of a genomic region of interest, each individual DNA fragment comprising the target nucleotide sequence may ligate to different combinations of DNA fragments originating from the genomic region of interest.
  • a sequence of the genomic region of interest may be built.
  • a DNA fragment ligated with the fragment comprising the target nucleotide sequence includes any fragment which may be present in ligation products.
  • ligation product means a DNA sequence which is generated by ligating DNA fragments together.
  • a ligation product comprises at least two DNA fragments.
  • the DNA fragments which are subsequently ligated have been produced by a previous fragmentation step.
  • the methods of the invention have the advantages that extensive sequence information is not required to focus on the genomic region of interest and the method is not sequence-biased (i.e. bias by using oligonucleotides and/or probes which cover the transgene of interest, allelic sequence of interest, or flanking sequences surrounding the sequence of interest, is avoided).
  • the methods of the invention may be used in the analysis of the 3D folding of regions of interest.
  • Methods for the analysis of the 3D folding of regions of interest are known in the art (see, for example, Sungalee et al (2021) Nature Genetics 53: 650-662).
  • the methods of the invention can be applied to the analysis of the 3D folding of regions of interest.
  • step a) a sample of crosslinked DNA is provided.
  • the sample may be obtained from an organism or from a tissue of an organism, or from tissue and/or cell culture, which comprises DNA.
  • the sample DNA may be from an organism may be obtained from any type of organism, e.g. micro-organisms, viruses, plants, fungi, animals, humans and bacteria, or combinations thereof.
  • a tissue sample from a human patient suspected of a bacterial and/or viral infection may comprise human cells, but also viruses and/or bacteria.
  • the sample may comprise cells and/or cell nuclei.
  • the sample may comprise or consist of isolated DNA.
  • the sample DNA is from a patient or a person who may be at risk of, suspected of having, or has a particular disease, for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
  • a particular disease for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
  • the sample DNA is from a patient or a person who is undergoing or has undergone gene therapy, for example using a lentiviral vector.
  • the sample DNA is from a cell culture that has been subject to transfection or transduction with a transgene, for example using a lentiviral vector.
  • Samples may be taken from a patient and/or from diseased tissue, and may also be derived from other organisms or from separate sections of the same organism, such as samples from one patient, one sample from healthy tissue and one sample from diseased tissue. Samples may thus be analysed according to the invention and compared with a reference sample, or different samples may be analysed and compared with each other. For example, for a patient being suspected of having cancer, a biopsy may be obtained from the suspected tumour. Another biopsy may be obtained from non-diseased tissue. Both tissue biopsies may be analysed according to the invention. Genomic regions of interest may be those containing a gene associated with the cancer type (e.g.
  • the sample may be a formalin cross-linking sample.
  • the sample may be a paraffin embedded sample.
  • the sample may be a Formalin-Fixed Paraffin- Embedded (FFPE) sample.
  • the sample may be a tissue sample.
  • the sample may be a tumour sample.
  • the sample may be a FFPE tumour sample.
  • the sample may be a slice or a puncture from a FFPE sample.
  • crosslinking means reacting DNA at two different positions, such that these two different positions may be connected.
  • the connection between the two different positions may be direct, forming a covalent bond between DNA strands.
  • Two DNA strands may be crosslinked directly using UV-irradiation, forming covalent bonds directly between DNA strands.
  • the connection between the two different positions may be indirect, via an agent, e.g. a crosslinker molecule.
  • a first DNA section may be connected to a first reactive group of a crosslinker molecule comprising two reactive groups, that second reactive group of the crosslinker molecule may be connected to a second DNA section, thereby crosslinking the first and second DNA section indirectly via the crosslinker molecule.
  • a crosslink may also be formed indirectly between two DNA strands via more than one molecule.
  • a typical crosslinker molecule that may be used is formaldehyde.
  • Formaldehyde induces protein-protein and DNA-protein crosslinks.
  • Formaldehyde thus may crosslink different DNA strands to each other via their associated proteins.
  • formaldehyde can react with a protein and DNA, connecting a protein and DNA via the crosslinker molecule.
  • two DNA sections may be crosslinked using formaldehyde forming a connection between a first DNA section (DNA1) and a protein
  • the protein may form a second connection with another formaldehyde molecule that connects to a second DNA section (DNA2), thus forming a crosslink which may be depicted as DNA1-crosslinker-protein-crosslinker-DNA2.
  • crosslinking according to the invention involves forming connections (directly or indirectly) between strands of DNA that are in physical proximity of each other.
  • DNA strands may be in physical proximity of each other in the cell, as DNA is highly organised, while being separated from a linear sequence point of view e.g. by 100kb.
  • the crosslinking method is compatible with subsequent fragmenting and ligation steps, such crosslinking may be contemplated for the purpose of the invention.
  • sample of crosslinked DNA refers to sample DNA which has been subjected to crosslinking.
  • Crosslinking the sample DNA has the effect that the three- dimensional state of the DNA within the sample remains largely intact. This way, DNA strands that are in physical proximity of each other remain in each others’ vicinity.
  • crosslinking the sample DNA as it is present in the sample results in largely maintaining the three dimensional architecture of the DNA.
  • the sample of crosslinked DNA is fragmented in step b). By fragmenting the crosslinked DNA, DNA fragments are produced which are held together by the crosslinks.
  • fragmenting DNA includes any technique that, when applied to the DNA, results in DNA fragments. Techniques well known in the art are sonication, shearing and/or enzymatic restriction, but other techniques can also be envisaged. Fragmenting techniques may result in random fragmentation of the DNA (e.g. sonication or shearing). Suitably, the fragmenting technique may result in non-random (i.e. targeted) fragmentation of the DNA (e.g. restriction enzymes or site-directed nucleases). Where a given step of the methods of the invention specifically requires random or non-random fragmentation, this is specified.
  • Random fragmentation of the DNA it is meant that the fragmenting technique results in DNA fragments with unknown end sequences.
  • sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in a subsequent step.
  • the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases.
  • the fragmenting step b) may comprise sonication, and may be followed by enzymatic DNA end repair. Sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in the subsequent step c). Alternatively, the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases.
  • the fragmenting step b) may be performed using S1 nuclease to generate blunt ended fragments.
  • the fragmenting step b) may be performed using DNasel.
  • non-random fragmentation of the DNA it is meant that the fragmenting technique results in DNA fragments with known end sequences, i.e. that the fragmenting is targeted.
  • non-random fragmentation involves fragmenting at a specific recognition sequence.
  • the non-random fragmentation of the DNA may be performed using a site-directed nuclease or a restriction enzyme which targets a specific recognition sequence.
  • the specific recognition sequence is also referred to herein as a “restriction enzyme site” or “restriction site”.
  • the term “recognition sequence” means a specific nucleotide sequence which is recognised by a fragmenting technique (e.g. a site-directed nuclease or restriction enzyme) and directs cleavage of the DNA molecule at or near the recognition sequence.
  • the specific nucleotide sequence which is recognized may determine the frequency of cleaving, e.g. a nucleotide sequence of 6 nucleotides occurs on average every 4096 nucleotides, whereas a nucleotide sequence of 4 nucleotides occurs much more frequently, on average every 256 nucleotides.
  • the fragmenting step b) comprises fragmenting with a restriction enzyme.
  • the fragmenting step b) may comprise fragmenting with one or more restriction enzymes, or combinations thereof.
  • the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 4 to 8 nucleotides in length, more preferably 4 to 6 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 3, 4, 5, 6, 7, 8, 9, 10, 11 or 12 nucleotides in length.
  • restriction endonuclease and “restriction enzyme” are used interchangeably to mean an enzyme that recognizes a specific nucleotide sequence (i.e. recognition sequence) in a double-stranded DNA molecule, and will cleave both strands of the DNA molecule at or near every recognition sequence, leaving a blunt end or a 3’- or 5’- overhanging end.
  • the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
  • a site-directed nuclease preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
  • CRISPR clustered regularly interspaced short palindromic repeats
  • Cas CRISPR associated protein
  • site-directed nuclease means a DNA-cutting enzyme (nuclease) which is directed to recognize a predetermined specific nucleotide sequence (i.e. recognition sequence) and to cleave both strands of the DNA molecule at or near every recognition sequence.
  • the site-directed nuclease may be engineered to target a desired recognition sequence.
  • the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et a!
  • the site-directed nuclease is a CRISPR-Cas nuclease.
  • step b) comprises fragmenting the crosslinked DNA of step a) by nonrandom fragmentation of the DNA at a recognition sequence using (synthetic) Cas9 or Cas12a.
  • the methods of the invention may also be performed using DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively.
  • DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively.
  • certain restriction enzymes preferentially fragment methylated DNA (as compared to unmethylated DNA) whilst other restriction enzymes preferentially fragment unmethylated DNA (as compared to methylated DNA).
  • the methods of the invention are applicable to the sequencing of alleles in which known sequences are either methylated or unmethylated.
  • the promoter sequences of actively transcribed genes are typically unmethylated, whereas the corresponding gene body sequences can contain enriched levels of methylation.
  • the digestion of unmethylated DNA or methylated DNA will result in the deselection of either the promoter or corresponding gene body sequence, respectively.
  • the methods of the invention permit the enrichment and sequencing of the promoter or corresponding gene body sequence.
  • the methods of the invention may be used in combination with bisulfite treatment. In this way, the methods may be used for the sequencing and quantification of epigenetic changes in alleles in which known sequences are either methylated or unmethylated.
  • nucleases examples include site-specific methyl-directed (MD) DNA endonucleases (e.g. Glal). These enzymes recognise and cleave methylated DNA sequences only and do not cleave unmethylated DNA sequences (Tarasova etal. (2008) BMC Mol. Biol. 9: 7). Suitably, a restriction enzyme or site-directed nuclease may be used.
  • MD site-specific methyl-directed
  • methylation-sensitive restriction enzymes that fragment unmethylated DNA include:
  • nucleases which preferentially fragment methylated and/or unmethylated DNA may be used.
  • the fragmenting step b) may comprise fragmenting with one or more site-directed nucleases, or combinations thereof.
  • Fragmenting with a restriction enzyme or site-directed nuclease is advantageous as it may allow greater control of the average fragment size.
  • the fragments that are formed may have compatible overhangs or blunt ends that allow ligation of the fragments in the subsequent step c).
  • restriction enzymes or site-directed nucleases with different recognition sites may be used. This is advantageous because by using different restriction enzymes or site-directed nucleases having different recognition sites, different DNA fragments can be obtained from each subsample.
  • the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
  • the fragments are ligated.
  • ligating involves the joining of separate DNA fragments.
  • the DNA fragments may be blunt ended, or may have compatible overhangs (also termed sticky overhangs or sticky ends) such that the overhangs can hybridise with each other.
  • the joining of the DNA fragments may be enzymatic, with a ligase enzyme, DNA ligase.
  • a non-enzymatic ligation may also be used, as long as DNA fragments are joined, i.e. forming a covalent bond.
  • a phosphodiester bond between the hydroxyl and phosphate group of the separate strands is formed.
  • a fragment comprising a target nucleotide sequence may be crosslinked to multiple other DNA fragments, more than one DNA fragment may be ligated to the fragment comprising the target nucleotide sequence. This may result in combinations of DNA fragments which are in proximity of each other as they are held together by the cross links. Different combinations and/or order of the DNA fragments in ligated DNA fragments may be formed.
  • the recognition sequence of the restriction enzyme or site-directed nuclease is known, which makes it possible to identify the fragments as remains of or reconstituted recognition sequences may indicate the separation between different DNA fragments.
  • the ligation step c) may be performed in the presence of an adaptor, ligating adaptor sequences in between fragments.
  • the adaptor may be ligated in a separate step. This is advantageous because the different fragments can be easily identified by identifying the adaptor sequences which are located in between the fragments. For example, in case DNA fragment ends were blunt ended, the adaptor sequence would be adjacent to each of the DNA fragment ends, indicating the boundary between separate DNA fragments.
  • the term “adaptor” refers to a short double-stranded oligonucleotide molecule with a limited number of base pairs, e.g. about 10 to about 30 base pairs in length, which are designed such that they can be ligated to the ends of fragments.
  • Adaptors are generally composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other. When mixing the two synthetic oligonucleotides in solution under appropriate conditions, they will anneal to each other forming a double-stranded structure.
  • one end of the adaptor molecule may be designed such that it is compatible with the end of a restriction fragment and can be ligated thereto; the other end of the adaptor can be designed so that it cannot be ligated, but this does need not to be the case, for instance when an adaptor is to be ligated in between DNA fragments.
  • step d the crosslinking is reversed in step d), which results in a pool of ligated DNA fragments that comprise two or more fragments.
  • a subpopulation of the pool of ligated DNA fragments comprises a DNA fragment which comprises the target nucleotide sequence.
  • reversing crosslinking comprises breaking the crosslinks such that the DNA that has been crosslinked is no longer crosslinked and is suitable for subsequent amplification and/or sequencing steps. For example, performing a protease K treatment on a sample DNA that has been crosslinked with formaldehyde will digest the protein present in the sample. Because the crosslinked DNA is connected indirectly via protein, the protease treatment in itself may reverse the crosslinking between the DNA. However, the protein fragments that remain connected to the DNA may hamper subsequent sequencing and/or amplification. Hence, reversing the connections between the DNA and the protein may also result in “reversing crosslinking”. The DNA-crosslinker-protein connection may be reversed through a heating step for example by incubating at 70°C.
  • any “reversing crosslinking” method may be contemplated wherein the DNA strands that are connected in a crosslinked sample becomes suitable for sequencing and/or amplification.
  • the present methods require performing a non-selective amplification of the ligated DNA products generated from the initial fragmentation and ligation steps of a proximity-based sequencing technique, such as TLA.
  • step e) of the present methods comprises a non-selective amplification of the ligated DNA generated in step d).
  • a “non-selective” amplification may also be referred to as a “universal”, “nonspecific” or “whole genome” amplification.
  • non-selective amplification refers to the use of an amplification technique which generates an amplified product that is completely representative of the initial starting material.
  • each unique ligation product is expected to be amplified and multiple copies generated.
  • the non-selective amplification of the ligation products generated in step d) of the present methods generates multiple copies of each unique ligation product that was present in the initial starting material.
  • Suitable non-selective amplification methods for example methods for amplifying the whole genome, are known in the art (see, for example, Kittier et al. (2002) Analytical Biochemistry 300: 237-244; Hard et al. (2021) bioRxiv 439527; doi: https://doi.org/10.1101/2021.04.13.439527; Telenius et al. (1992) Genomics 13(3) 718-725; WO2019/148119; Qiagen REPLI-g FFPE Kit (Cat. No. I ID: 150243); Langmore (2002) Pharmacogenomics 3: 557-560).
  • the non-selective amplification may be multiple strand displacement (e.g. as descrined in Telenius etal. - as above).
  • the non-selective amplification may be performed using a Qiagen REPLI-g FFPE Kit.
  • the non-specific amplification may increase the amount of DNA in the starting material by at least 4-, 5-, 10-, 20-, 50-, 100- or 1000-fold.
  • the non-selective amplification step may be optional. That is, the non-selective amplification step may be omitted. As such, all features and embodiments of the methods described herein may be applied to corresponding methods in which the non-selective amplification step is not performed.
  • the non- selective amplification may not be required if sufficient input material is available, but physical separation of the ligation products in order to perform separate enrichments with different primers specific for a genomic region of interest will still present quality advantages in terms of the length of amplicons generated, for example.
  • the amplified, ligated DNA generated is physically separated into at least a first sample and a second sample.
  • the method may comprise separating the amplified, ligated DNA into any number of separate samples - as required.
  • the method may comprise separating the amplified, ligated DNA generated in step e) into at least 3, at least 4, at least 5, at least 8 or at least 10 sub-samples.
  • the physical separation of step e) may be performed such that at least one copy of each essentially unique ligation product (which will have been multiplied during the non-specific amplification step) is present in each separate sample generated in step e).
  • any ligation product comprising a particular target nucleotide sequence to be enriched for should be present in each sample.
  • the present methods thus minimize the loss of sequence information due to a particular ligation product not being present in a sample in which an enrichment using a given target nucleotide sequence is performed.
  • the size - or amount of amplified, ligated DNA - in each separated sample is not particularly limiting and may be determined based on the specifics of the method to be performed.
  • the method comprises enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample.
  • the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
  • each is enriched for at least one target nucleotide sequence which is not enriched for in any other sample generated in step f).
  • Performing multiple, separate enrichments directed against different (combinations of) target nucleotide sequences enables comprehensive enrichment of an increased number of ligation products comprising a greater number of target nucleotide sequences from a genomic region of interest.
  • the present methods thus enable the effective use of multiple target nucleotide sequences in protocols that maximize, for example, the length and number of enriched ligation products.
  • the term “enriching” or “enrichment” for “DNA comprising the at least one of the plurality of target nucleotide sequences” means a process by which the (absolute) amount and/or proportion of the DNA comprising the target nucleotide sequence is increased compared to the amount and/or proportion of DNA comprising the target nucleotide sequence in the starting material (i.e. in the initial separated sample generated in step f)).
  • enrichment by amplification increases the amount and proportion of DNA comprising the target nucleotide sequence.
  • Both enrichment by degradation and capture-based enrichment increase the proportion of DNA comprising the target nucleotide sequence.
  • the methods of the invention are compatible with a wide variety of enrichment approaches. Suitable enrichment methods include, but are not limited to, PCR amplification, capture-based enrichment and/or site-directed nuclease digestion.
  • enrichment step g) may comprise performing a singleplex enrichment in one or more of the separate samples generated in step f).
  • a singleplex enrichment refers to an enrichment using a single target nucleotide sequence of the plurality of target nucleotide sequences in a sample generated in step f).
  • enrichment step g) may comprise performing a singleplex enrichment using a single target nucleotide sequence of the plurality of target nucleotide sequences in each separate sample generated in step f).
  • one enrichment corresponding to 1 of 5 target nucleotide sequences may be used individually in one sample generated in step f) (e.g. A, B, C, D or E).
  • one of the 5 target nucleotide sequences may be used individually in each of a set of 5 separate samples generated in step f) (e.g. A, B, C, D and E are each individually enriched in a separate sample).
  • enrichment step g) may comprise performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in one or more of the separate samples generated in step f).
  • multiplex is generally used herein to refer to an enrichment strategy in which at least two target nucleotide sequences are used to enrich for ligation products comprising those target nucleotide sequences in a single sample.
  • an enrichment using 2 of 5 target nucleotide sequences may be used in combination in at least one sample generated in step f) (e.g. A and B, A and C, A and D or A and E etc.).
  • an enrichment using 3 of 5 target nucleotide sequences may be used in combination in at least one sample generated in step f) (e.g. A and B and C; A and C and D; or A and D and E etc.).
  • enrichment step g) may comprise performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in each of the separate samples generated in step f).
  • enrichment step g) may comprise performing a multiplex enrichment using at least two, at least three, at least four, at least five, at least eight or at least ten target nucleotide sequences in one or more of the separate samples generated in step f).
  • the target nucleotide sequences used in separate enrichments may originate from genomic positions within a 1 kb, 2kb or 10kb physical distance in the linear genomic sequence.
  • the target nucleotide sequences used in separate enrichments may originate from adjacent restriction fragments generated in the non-random fragmentation
  • Enrichment step g) may comprise a PCR amplification using primers against one or more target nucleotide sequence as described herein.
  • Enrichment step g) may comprise target nucleotide sequences originating from both DNA strands. PCR amplifications may therefore comprise primers against both DNA strands.
  • oligonucleotide primers or “primers” are used interchangeably, in general, to refer to strands of nucleotides which can prime the synthesis of DNA.
  • DNA polymerase cannot synthesize DNA de novo without primers.
  • a primer hybridises to the DNA i.e. base pairs are formed.
  • Nucleotides that can form base pairs, that are complementary to one another, are e.g. cytosine and guanine, thymine and adenine, adenine and uracil, guanine and uracil.
  • the complementarity between the primer and the existing DNA strand does not have to be 100%, i.e.
  • primers not all bases of a primer need to base pair with the existing DNA strand. From the 3’-end of a primer hybridised with the existing DNA strand, nucleotides are incorporated using the existing strand as a template (template directed DNA synthesis).
  • template directed DNA synthesis The synthetic oligonucleotide molecules which are used in an amplification reaction may be referred to as “primers”.
  • amplifying refers to a polynucleotide amplification reaction, namely, a population of polynucleotides that are replicated from one or more starting sequences. Amplifying may refer to a variety of amplification reactions, including but not limited to polymerase chain reaction (PCR), linear polymerase reactions, nucleic acid sequence- based amplification, rolling circle amplification and like reactions.
  • PCR polymerase chain reaction
  • linear polymerase reactions nucleic acid sequence- based amplification
  • rolling circle amplification rolling circle amplification
  • the amplified, ligated DNA may be circularised after e) and prior to step f) or step g); and step g) may be performed by PCR enrichment using inverse primer pairs, wherein at least one of the primers of the inverse primer pair is specific for the target nucleotide sequence.
  • each primer of the inverse primer pair is specification for the target nucleotide sequence.
  • the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); e’) circularizing the amplified, ligated DNA generated in step e); f) separating the circularized DNA generated in step e’) into at least a first sample and a second sample; g) enriching at least the first and the second samples of circularized DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences using inverse primer pairs; wherein the amplified, ligated DNA in the first sample is enriched for at least one
  • the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the circularized DNA generated in step e) into at least a first sample and a second sample; f’) circularizing the amplified, ligated DNA of the at least the first and the second samples generated in step e); g) enriching at least the first and the second samples of circularized DNA generated in step f’) for DNA comprising at least one of the plurality of target nucleotide sequences using inverse primer pairs; wherein the amplified, ligated DNA in step
  • universal adapters may be ligated to the amplified, ligated DNA after step e) and prior to step f) or step g); and step g) may be performed by PCR enrichment with primer pairs complementary to the target nucleotide sequence and the universal adapter.
  • the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); e’) ligating universal adaptors to the amplified, ligated DNA generated in step e); f) separating the amplified, ligated DNA generated in step e’) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences using primer pairs complementary to the target nucleotide
  • the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; f’) ligating universal adaptors to the amplified, ligated DNA generated in step f); g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f’) for DNA comprising at least one of the plurality of target nucleotide sequences using primer pairs complementary to the target nucleotide
  • universal primers targeting the universal adapters can be used in combination with a target sequence specific primer in a PCR-based enrichment in step g).
  • PCR-based enrichment strategies have particular advantages when used in the context of PCR-based enrichment strategies.
  • potential advantages include, but are not limited to, the ability to maximise the length of PCR amplicons comprising target nucleotide sequences generated and thus increase amount the sequence information subsequently retrievable; increasing the number of ligation products that can be meaningfully enriched based on one or more target nucleotide sequences; and increasing the number distinct PCR amplicons that can be generated from ligation products comprising one or more target nucleotide sequences.
  • an identifier is included in the at least one primer.
  • identifier refers to a short sequence that can be added to an adaptor or a primer or included in its sequence or otherwise used as label to provide a unique identifier.
  • sequence identifier or tag
  • Typical examples are ZIP sequences, known in the art as commonly used tags for unique detection by hybridization (lannone et al. Cytometry 39:131-140, 2000). Identifiers are useful according to the invention, as by using such an identifier, the origin of a sample (e.g.
  • a PCR sample can be determined upon further processing.
  • the different nucleic acid samples may be identified using different identifiers. For instance, as according to the invention sequencing may be performed using high throughput sequencing, multiple samples may be combined. Identifiers may then assist in identifying the sequences corresponding to the different samples. Identifiers may also be included in adaptors for ligation to DNA fragments assisting in DNA fragment sequences identification. Identifiers preferably differ from each other by at least two base pairs and preferably do not contain two identical consecutive bases to prevent misreads. The identifier function can sometimes be combined with other functionalities such as adaptors or primers.
  • primers carrying a moiety may be used for the optional purification of (amplified) ligated DNA fragments through binding to a solid support (e.g. streptavidin-coated beads).
  • a solid support e.g. streptavidin-coated beads.
  • Capture-based enrichment using the moiety may then be performed as described below in the context of a hybridisation probe (except that the capture is performed by the biotin-streptavidin interaction, rather than hybridization of complementary nucleic acids).
  • the enriching step g) comprises capture-based enrichment of the at least one of the plurality of target nucleotide sequences, preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
  • the DNA fragments comprising the target nucleotide sequence may be captured with a hybridisation probe (also termed a capture probe) that hybridises to a target nucleotide sequence.
  • the hybridisation probe may be attached directly to a solid support, or may comprise a moiety, e.g. biotin, to allow binding to a solid support suitable for capturing biotin moieties (e.g. beads coated with streptavidin).
  • the DNA fragments comprising a target nucleotide sequence are captured thus allowing separation of ligation products comprising the target nucleotide sequence from ligation products not comprising the target nucleotide sequence.
  • such a capture step allows enrichment for ligation products comprising the target nucleotide sequence.
  • a genomic region of interest comprising a target nucleotide sequence
  • at least one capture probe for the target nucleotide sequence may be used.
  • more than one probe may be used for multiple target nucleotide sequences (e.g. at least one probe for each target nucleotide sequence may be used).
  • one probe corresponding to 1 of 5 target nucleotide sequences may be used as a capture probe (A, B, C, D or E).
  • the 5 probes may be used in a combined fashion (A, B, C, D and E) to capture the genomic region of interest.
  • the present invention enables a greater number of target nucleotide sequences to be used whilst minimizing the risk of missing ligation products that only contain one target nucleotide sequence.
  • a capture probe may be used that hybridises to an adaptor sequence comprised in amplified, ligated DNA generated in step e).
  • an amplification step and capture step are combined, e.g. first performing a capture step and then an amplification step or vice versa.
  • Site-directed nuclease digestion can also be used for the selective amplification of ligation products of interest. Site-directed digestion can be used to selectively add adaptors to and enable the amplification of linear ligation products comprising the target nucleotide sequence.
  • the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et a! , Nucleic Acids Research; 2020; 48(4); e19) or a CRISPR-Cas nuclease (e.g. Cas9).
  • ZFN zinc-finger nuclease
  • TALEN transcription activator-like effector nuclease
  • Ago Argonaute protein
  • the site- directed nuclease is a CRISPR-Cas nuclease.
  • Site-directed nuclease digestion can also be used for the selective linearization of ligation products of interest. If a proximity ligation protocol (e.g. a TLA protocol as described herein) is used for the generation of circular DNA template (e.g. circular TLA template), a site-directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nanopore sequencing approaches which will very preferentially sequence linearized DNA molecules.
  • a proximity ligation protocol e.g. a TLA protocol as described herein
  • circular DNA template e.g. circular TLA template
  • a site-directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nano
  • the enriching step g) comprises site-directed nuclease-based enrichment of the at least one of the plurality of target nucleotide sequences, preferably wherein the site-directed nuclease-based enrichment comprises using a site-directed nuclease followed by amplifying the digested DNA using (e.g. inverse) PCR or capture-based enrichment.
  • site-directed nuclease-based enrichment comprises using a site-directed nuclease followed by amplifying the digested DNA using (e.g. inverse) PCR or capture-based enrichment.
  • an amplification step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then an amplification step or vice versa.
  • a capture step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then a capture step or vice versa.
  • a different enrichment strategy as described herein may be performed on at least a first sample and a second sample generated during step f).
  • a PCR-based enrichment may be performed on a first sample
  • a capture-based or site-directed nuclease-based enrichment may be performed on a second sample.
  • At least part of the sequence of the enriched DNA generated from each sample in step g) may be determined.
  • At least two of the enriched samples generated in step f) are pooled together before the sequencing step is performed.
  • each of the enriched samples generated in step f) are pooled together before the sequencing step is performed.
  • the DNA may be prepared as a DNA sequencing library and/or sequenced according to standard protocols. Conventional whole genome sequencing or high-throughput sequencing (e.g. NGS) approaches can be used. Determining the sequence is preferably performed using high throughput sequencing technologies, as this is more convenient and allows a high number of sequences to be determined to cover the complete genomic region of interest.
  • NGS high-throughput sequencing
  • DNA sequencing library means a sequencing-ready DNA library.
  • the methods of the invention generate a compatible library (e.g. an NGS compatible library) for sequencing applications.
  • a DNA sequencing library of a plurality of genomic regions of interest is made.
  • step h) is performed using whole genome sequencing.
  • the genomic region of interest comprises a transgene integration site, it may be desirable to sequence the whole genome.
  • step h) comprises determining at least part of the sequence of the enriched DNA comprising the target nucleotide sequence.
  • step h) comprises determining the whole sequence of the enriched DNA comprising the target nucleotide sequence.
  • sequencing refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA.
  • bases sequences e.g. DNA or RNA.
  • the step of determining the sequence of DNA preferably comprises high throughput sequencing.
  • High throughput sequencing methods are well known in the art, and in principle any method may be contemplated to be used in the invention.
  • High throughput sequencing technologies may be performed according to the manufacturer’s instructions (as e.g. provided by Roche, Illumina or Thermo Fisher).
  • sequencing adaptors may be ligated to the (amplified) undigested DNA fragments.
  • the amplified product is linear, allowing the ligation of the adaptors. Suitable ends may be provided for ligating adaptor sequences (e.g. blunt, complementary staggered ends).
  • primer(s) used for PCR or other amplification method may include adaptor sequences, such that amplified products with adaptor sequences are formed.
  • the circularized fragment may be fragmented, preferably by using for example a restriction enzyme in between primer binding sites for the inverse PCR reaction, such that DNA fragments ligated with the DNA fragment comprising the target nucleotide sequence remain intact. Sequencing adaptors may also be included in step c) of the methods of the invention.
  • long reads may be generated in the high throughput sequencing method used. Long reads may allow reading across multiple DNA fragments within undigested DNA fragments (which contain ligated DNA fragments). This way, DNA fragments of step b) may be identified. DNA fragment sequences may be compared to a reference sequence and/or compared with each other.
  • short reads may also be contemplated to read even shorter sequences, for instance, short reads of 50- 100 nucleotides. In case a standard sequencing protocol would be used, this may mean that the information regarding the undigested DNA fragments may be lost. With short reads it may not be possible to identify a complete DNA fragment sequence. In case such short reads are contemplated, it may be envisioned to provide additional processing steps such that separate ligated DNA fragments when fragmented, are ligated or equipped with identifiers, such that from the short reads, contigs may be built for the ligated DNA fragments. Such high throughput sequencing technologies involving short sequence reads may involve paired end sequencing.
  • the short reads from both ends of a DNA molecule used for sequencing may allow coupling of DNA fragments that were ligated. This is because two sequence reads can be coupled spanning a relatively large DNA sequence relative to the sequence that was determined from both ends. This way, contigs may be built for the DNA fragments.
  • the step of determining at least part of the sequence of the DNA sequence may comprise short sequence reads, but preferably longer sequence reads are determined such that DNA fragment sequences may be identified.
  • the primer sequence may be removed prior to the sequencing step h) (e.g. the high throughput sequencing step).
  • a contig may be built of the genomic region of interest.
  • overlapping reads may be obtained from which the genomic region of interest may be built.
  • the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step h).
  • a contig is used in connection with DNA sequence analysis, and refers to reassembled contiguous stretches of DNA derived from two or more DNA fragments having contiguous nucleotide sequences.
  • a contig may be a set of overlapping DNA fragments that provides a (partial) contiguous sequence of a genomic region of interest.
  • a contig may also be a set of DNA fragments that, when aligned to a reference sequence, may form a contiguous nucleotide sequence.
  • the term “contig” encompasses a series of (ligated) DNA fragment(s) which are ordered in such a way as to have sequence overlap of each (ligated) DNA fragment(s) with at least one of its neighbours.
  • the linked or coupled (ligated) DNA fragment(s) may be ordered either manually or, preferably, using appropriate computer programs such as FPC, PHRAP, CAP3 etc, and may also be grouped into separate contigs.
  • step b) when in step b) a plurality of subsamples is generated, using different restriction enzymes or site-directed nucleases, overlapping reads will also be obtained. By increasing the plurality of subsamples, the number of overlapping fragments will increase, which may increase the reliability of the contig of the genomic region of interest that is built. From these determined sequences which may overlap, a contig may be built. Alternatively, if sequences do not overlap, e.g. when a single restriction enzyme may have been used in step b), alignment of DNA fragments with a reference sequence may allow to build a contig of the genomic region of interest.
  • a contig is built for each ploidy.
  • the step of building a contig comprises the steps of:
  • step b 1) identifying the fragments of step b);
  • the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step d) and coupling of the different ligation products to the identified fragments.
  • the invention may be used to provide for quality control of generated sequence information.
  • sequencing errors may occur.
  • a sequencing error may occur for example during the elongation of the DNA strand, wherein an incorrect (i.e. non- complementary to the template) base is incorporated in the DNA strand.
  • a sequencing error is different from a mutation, as the original DNA which is amplified and/or sequenced would not comprise that incorrect base.
  • DNA fragment sequences may be determined, with (at least part of) sequences of DNA fragments ligated thereto, which sequences may be unique. The uniqueness of the ligated DNA fragments as they are formed in step c) may provide for quality control of the determined sequence in step h).
  • a size selection step may be performed prior to or after the enrichment step g).
  • a size selection step may be performed using gel extraction chromatography, gel electrophoresis or density gradient centrifugation, which are methods generally known in the art.
  • DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs.
  • a size separation step allows to select for (amplified) ligated DNA fragments in a size range that may be optimal for PCR amplification and/or optimal for the sequencing of long reads by next generation sequencing.
  • size selection involves techniques with which particular size ranges of molecules, e.g. (ligated) DNA fragments or amplified (ligated) DNA fragments, are selected. Techniques that can be used are for instance gel electrophoresis, size exclusion, gel extraction chromatography, but are not limited thereto, as long as molecules with a particular size can be selected, such a technique will suffice.
  • the ligated DNA fragments generated in step c) may be further fragmented prior to or after the non-selective amplification of step e). Accordingly, the ligated DNA fragments generated in step c) may be further fragmented prior to the separation performed in step f).
  • the fragmenting step b) and the optional further fragmenting step may be aimed at obtaining ligated DNA fragments of a size which is compatible with the subsequent enrichment step (e.g. amplification step) and/or sequence determination step.
  • a further fragmenting step preferably with an enzyme, may result in ligated fragment ends which are compatible with the optional ligation of an adaptor.
  • the further fragmenting step may be performed after reversing the crosslinking, however, it is also possible to perform the further fragmenting step and/or ligation step while the DNA fragments are still crosslinked.
  • At least one adaptor may be ligated to the obtained ligated DNA fragments generated in the further fragmenting step.
  • the ends of the ligated DNA fragments need to be compatible with ligation of such an adaptor.
  • the ligated DNA fragments may be linear DNA
  • ligation of an adaptor may provide for a primer hybridisation sequence.
  • the adaptor sequence ligated with ligated DNA fragments comprising the target nucleotide sequence will provide for DNA molecules which may be amplified using PCR as described herein.
  • Ligated adapter sequences can also be used as described herein to prevent exonuclease based digestion.
  • the DNA is further fragmented with a restriction enzyme or site-directed nuclease as described herein.
  • both the fragmentation performed in step b) and further fragmenting steps comprise the use of restriction enzymes or site-directed nucleases
  • the recognition sequence of fragmentation step b) may be longer than the recognition sequence of the further fragmentation step.
  • the enzyme of step b) thus cuts at a lower frequency than the further fragmentation step. This means that the average DNA fragment size of the further fragmentation step is smaller than the average fragment size generated in step b). This way, in fragmenting step b), relatively large fragments are formed, which are subsequently ligated and the second enzyme of the further fragmentation step cuts more frequently than the enzyme of step b).
  • the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting.
  • the restriction enzyme recognition site of the second fragmentation step may be longer than the recognition site of the restriction enzyme used in the first fragmentation step.
  • the second enzyme thus cuts at a lower frequency than the first enzyme. This means that the average DNA fragment size after the first fragmentation is smaller than the average fragment size obtained after the second fragmentation step. This way, in the first fragmenting step, relatively small fragments are formed, which are subsequently ligated. As the second restriction enzyme cuts less frequently, most of the DNA fragments may not comprise the restriction recognition site of the second restriction enzyme. Thus, when the ligated DNA fragments are subsequently fragmented in the second fragmentation step, many of the initial DNA fragments may remain intact.
  • the first fragmenting step is less frequent than the second optional fragmenting step, the result would be that the initial fragment are generally further fragmented, which may result in the loss of relatively large DNA sequences that are useful for building a contig.
  • the first fragmenting step is more frequent as compared to the second optional fragmenting step, such that DNA fragments may largely remain intact, i.e. are largely not further fragmented in the second optional fragmentation step.
  • the invention provides a method for determining the sequence of a genomic region of interest comprising multiple target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target
  • the method of the third aspect of the invention may be performed as described herein with respect to the first aspect of the invention.
  • the steps of the method of the third aspect of the invention can be carried out as described herein for the corresponding steps of the first aspect of the invention.
  • all embodiments of the invention described herein with respect to the first aspect of the invention are applicable to the third aspect of the invention.
  • methods are provided for identifying the presence or absence of a genetic mutation.
  • a method for identifying the presence or absence of a genetic mutation comprising the steps a)-h) of any of methods of the second or third aspects of the invention as described above, wherein contigs are built for a plurality of samples, comprising the further steps of: i) aligning the contigs of a plurality of samples; and j) identifying the presence or absence of a genetic mutation in the genomic regions of interest from the plurality of samples.
  • a method for identifying the presence or absence of a genetic mutation comprising the steps a)-h) of any of the methods of the invention as described above, comprising the further steps of: i) aligning the contig to a reference sequence; and j) identifying the presence or absence of a genetic mutation in the genomic region of interest.
  • Genetic mutations can be identified for instance by comparing the contigs of multiple samples, in case one (or more) of the samples comprises a genetic mutation, this may be observed as the sequence of the contig is different when compared to the sequence of the other samples, i.e. the presence of a genetic mutation is identified. In case no sequence differences between contigs of the samples is observed, the absence of genetic mutation is identified.
  • a reference sequence may also be used to which the sequence of a contig may be aligned. When the sequence of the contig of the sample is different from the sequence of the reference sequence, a genetic mutation is observed, i.e. the presence of a genetic mutation is identified. In case no sequence differences between the contig of the sample or samples and the reference sequence is observed, the absence of genetic mutation is identified.
  • a method is provided for identifying the presence or absence of a genetic mutation, according to any of the methods as described above, without the further step of building a contig.
  • Such a method comprises the steps a)-h) of any of the methods as described above and the further steps of: i) aligning the determined sequences of the (amplified) undigested DNA fragments generated in step h) to a reference sequence; and j) identifying the presence or absence of a genetic mutation in the determined sequences.
  • a method for identifying the presence or absence of a genetic mutation wherein of a plurality of samples sequences of (amplified) undigested DNA fragments are determined, comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) aligning the determined sequences (generated in step h)) of the (amplified) undigested DNA fragments of a plurality of samples; j) identifying the presence or absence of a genetic mutation in the determined sequences.
  • a sample of crosslinked DNA is provided from heterogeneous cell populations (e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)
  • heterogeneous cell populations e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)
  • for each genomic region of interest corresponding to different genomic environment which may e.g. be different genomic environments from different alleles in a cell or different genomic environments from different cells
  • contigs may be built.
  • the ratio of fragments or ligation products carrying an allele, transgene or genetic mutation may be determined, which may correlate to the ratio of alleles or cells carrying the genetic mutation or the transgene. Since the ligation of DNA fragments is a random process, the collection and order of DNA fragments that are part of the ligation products may be unique and represent a single cell and/or a single genomic region of interest from a cell.
  • identifying ligation products comprising the fragment with the allele, genetic mutation or transgene may also comprise identifying ligation products with a unique order and collection of DNA fragments.
  • the ratio of alleles or cells carrying a genetic mutation or transgene may be of importance in evaluation of therapies, e.g. in case patients are undergoing therapy for cancer, such as gene therapy. Cancer cells may carry a particular genetic mutation or cells may carry a particular transgene. The percentage of cells carrying such a mutation or the transgene may be a measure for the success or failure of a therapy.
  • methods are provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene, and/or the ratio of ligation products carrying a genetic mutation.
  • a genetic mutation is defined as a particular genetic mutation or a selection of particular genetic mutations.
  • a method for determining the ratio of fragments carrying an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) identifying the fragments of step b); j) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; k) determining the number of fragments carrying the allele, genetic mutation or transgene; l) determining the number of fragments not carrying the allele, genetic mutation or transgene;
  • a method for determining the ratio of ligation products carrying a fragment with an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) identifying the fragments of step b); j) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; k) identifying the ligation products of step c) carrying the fragments with or without the allele, genetic mutation or transgene; I) determining the number of ligation products carrying the fragments with the allele, genetic mutation or transgene; m) determining the number of ligation products carrying the fragments without the allele, genetic mutation or transgene; n) calculating the ratio of ligation products carrying the allele, genetic mutation or transgene.
  • the presence or absence of an allele, genetic mutation or transgene may be identified in step j) by aligning to a reference sequence and/or by comparing DNA fragment sequences of a plurality of samples.
  • an identified genetic mutation may be a SNP, single nucleotide polymorphism, an insertion, an inversion and/or a translocation.
  • the number of fragments and/or ligation products from a sample carrying the deletion and/or insertion may be compared with a reference sample in order to identify the deletion and/or insertion.
  • a deletion, insertion, inversion and/or translocation may also be identified based on the presence of chromosomal breakpoints in analyzed fragments.
  • the presence or absence of methylated nucleotides is determined in DNA fragments, ligated DNA fragments, and/or genomic regions of interest.
  • the DNA of step a)-g) may be treated with bisulphite.
  • Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5- methylcytosine residues unaffected.
  • bisulphite treatment introduces specific changes in the DNA sequence that depend on the methylation status of individual cytosine residues, yielding single- nucleotide resolution information about the methylation status of a segment of DNA.
  • methylated nucleotides may be identified.
  • sequences from a plurality of samples treated with bisulphite may also be aligned, or a sequence from a sample treated with bisulphite may be aligned to a reference sequence.
  • Example 1 Illustrative example of targeted sequencing of rare integrated viruses and integration sites
  • Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysis buffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1 % TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubated for 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • lysis buffer 50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1 % TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubated for 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • the fixated lysed cells are digested with a restriction enzyme targeting a restriction site sequence which occurs 5 times in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme results in the generation of seven fragments.
  • Target nucleotide sequences are chosen in proximity to each of the restriction sites.
  • the restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009).
  • Non-selective amplifications of the ligation product is performed using multiple strand displacement (MSD) amplification (Telenius etal. (1992) Genomics 13(3) 718-725).
  • MSD multiple strand displacement
  • the MSD technique employs a unique and highly processive mesophillic DNA polymerase, phi29.
  • the resulting product consists of long, 10-50 kb, fragments, good amplification and representation.
  • the sample is digested with a fragmenting strategy that, on average, will result in fragments of around 3000 bp in size (as a result of which the resulting circularised DNA can be amplified effectively).
  • the product of the non-selective amplification is separated in 7 separate tubes.
  • the primers used for the PCR-enrichment are designed as inverted unique primers specific for the target nucleotide sequence.
  • the amplified DNA can be library prepped and sequenced according to standard protocols.
  • Deparaffinization buffer 1x CutSmart buffer (Invitrogen) containing 0.02% Igepal 10x
  • Ligation buffer 660 mM Tris pH 7.5, 50 mM MgCI2, 50 mM DTT, 10 mM ATP
  • FFPE Deparaffinization 1 Samples are heating in Deparaffinization buffer for 3 min at 80°C while shaking at 900 RPM, before centrifugation for 2 min, 10.000xg at 40°C. The paraffin layer is removed using a pipet tip. The heating to paraffin layer removal steps are then repeated. SDS is added to the supernatant and the tissue is transfered with the buffer to a 130 pl Covaris Screw Cap microTUBE. Samples are sonicated on a Covaris M220 for 300 seconds, Duty factor 20%, Power 75 Watts, 200 cycles/burst at 20°C. Samples are transferred back to the same 1.5 ml Eppendorf tube. CutSmart buffer is added before incubation for 2 hours at 80°C while shaking at 900 RPM. Triton X-100 is then added before incubation for 30 minutes at 37°C while shaking at 900 RPM.
  • the sample is cooled to room temperature (RT).
  • Ligation buffer, T4 DNA ligase and deionized water are then added and the sample incubated for 2 hours at RT while tumbling.
  • NaCI, SDS and Proteinase K (Roche) are then added before incubating for 1 hour at 56°C and 16 hours at 80°C.
  • DNA is purified using NucleoMag P-beads (Marcherey Nagel)
  • Non selective amplifications of the ligation product is performed using the RepliG FFPE kit using manufacturer’s instructions.
  • Amplicons are fragmented to an average size of 2kb. Universal adaptors are ligated to generated fragments using manufacturer’s instructions.
  • the product of the non-selective amplification, fragmentation and adaptor ligation is separated in 10 separate tubes.
  • Separate amplifications are performed with 10 different primer sets consisting of universal primers specific for the adaptors ligated previously and sets of 10 primers specific for different restriction fragments of the BRCA gene resulting from the first fragmentation step.
  • the primers used in each individual multiplex are spaced evenly across the BRCA gene (i.e. of the 100 primers used in all amplifications, the first mix represents the 1 st , 11 th , 21 st , etc. position across the gene of interest, the second mix the 2 nd , 12 th , 22 nd , etc.)
  • the amplified DNA can be library prepped and sequenced according to standard protocols.

Abstract

The present invention relates to a method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.

Description

METHOD
FIELD OF THE INVENTION
The present invention relates to the field of molecular biology and more in particular to DNA technology. The invention in more detail relates to the sequencing of DNA. The invention relates to strategies for determining (part of) a DNA sequence of a genomic region of interest and the detection of single nucleotide variants (SNVs) and structural variants, including previously unknown variants.
BACKGROUND
A number of targeted sequencing approaches have been developed that rely on the physical proximity of sequences to generate and enrich sequencing templates. For example, Targeted Locus Amplification (TLA) enables targeted enrichment and complete sequencing of a genomic region of interest comprising one or more target nucleotide sequences, i.e. of the linear chromosome template surrounding a target nucleotide sequence. TLA and other physical proximity protocols include an initial crosslinking step and are based on the concept that, in general, the chance of different fragments being crosslinked correlates inversely with the linear distance, i.e. the frequency of intra-chromosomal crosslinking or crosslinking within a genomic locus is on average always higher than that of sequences from physically distant positions in the linear genome sequence or from other DNA fragments (e.g. different isolated DNA molecules, different chromosomes, episomal copies of a vector etc.). Thus, DNA fragments that ligate to a DNA fragment comprising the target nucleotide sequence are representative of the genomic region of interest comprising the target nucleotide sequence.
The TLA approach involves crosslinking of DNA, fragmenting the crosslinked DNA (e.g. with a restriction enzyme), followed by ligation of the crosslinked DNA fragments. The ligated DNA fragments comprising the target nucleotide sequence, and thus the genomic region of interest, may be enriched, e.g. by PCR. The sequence of the genomic region of interest on the linear chromosome template can subsequently be determined using (high throughput) sequencing technologies well known in the art (see WO2012/005595 and de Vree et a! , Nature Biotechnology; 32; 1019-1025; 2014).
Proximity-based sequencing techniques, such as TLA, are thus able to enrich for unknown sequences resulting from to be detected single nucleotide and structural genetic variation in clinically relevant genes. As a result, TLA enables the targeted enrichment and sequencing of any locus or (trans)gene of interest and allows for detection of single nucleotide variants (SNVs) and structural variants, including previously unknown variants. The TLA technology can be applied on cells, HMW DNA and FFPE samples (e.g. FFPE tumour biopsies).
However, because of the randomised re-shuffling of DNA fragments that occurs during the fragmentation and religation steps, such methods result in the generation of multiple essentially unique ligation products. As such, only a small subset of ligation products comprising DNA fragments from a genomic region of interest will also comprise a certain target nucleotide sequence and the subset of the ligation products comprising a target nucleotide sequence will vary in composition. Maximising the enrichment and sequencing of ligation products comprising a target nucleotide sequence and DNA fragments originating from unknown sequences is important for maximising the sequence information generated for the genomic region of interest. In contrast, because each ligation product is essentially unique, any ligation product comprising a potential target nucleotide sequence which is not subject to enrichment - and thus does not contribute to the sequencing output - represents lost sequence information from the genomic region of interest. This is a particular consideration in samples with limited input material, for example FFPE tumour biopsies, as a lower starting amount of input material means that each unique ligation product generated during a TLA protocol represents a greater proportion of the total potential sequence information available for the genomic region of interest.
It is thus important to maximise both the number and proportion of ligation products comprising a target nucleotide sequence of interest that are analysed and sequenced during a proximitybased sequencing method, such as TLA. It is also important to maximise the length of individual, enriched ligation products which are then sequenced.
SUMMARY OF THE INVENTION
In order to increase the amount of sequence information generated for a genomic region of interest using a proximity-based sequencing technique, it is desirable to use multiple target nucleotide sequences within the genomic region. In particular, enrichments utilising multiple target nucleotide sequences enable greater coverage across the genomic region of interest because they allow a greater proportion of essentially unique ligation products comprising unknown sequences from the genomic region of interest to be interrogated.
However, it has been determined that the ligation products generated during proximity-based sequencing techniques, such as TLA, are not optimally compatible with standard techniques used to enrich DNA for multiple target sequences. For instance, the ligation products generated during TLA are not optimally amenable to multiplex PCR, wherein primers directed against multiple target nucleotide sequences would be used to enrich for ligation products comprising those target nucleotide sequences.
In particular, the nature of the randomised reshuffling of DNA fragments that occurs during the initial fragmentation and re-ligation steps of a TLA protocol means that (i) the TLA protocol will result in a mixture of different unique ligation products and (ii) one cannot predict the identity and organisation of individual DNA fragments within the ligation products produced. This is relevant in respect of the presence, position, and orientation of fragments within each essentially unique ligation product.
As such, the use of additional target nucleotide sequences (e.g. in a multiplex PCR approach) can hamper the enrichment of complete ligation products and result in aberrant and/or shorter enrichment products. For example, if two DNA fragments comprising different target nucleotide sequences end up adjacent, and in an inverse orientation, to each other in a ligation product generated following fragmentation and religation; the amplification product generated by two primers directed against those target nucleotide sequences will be subject to a very preferential amplification in view of the short amplicon size from that specific ligation product. The preferential amplification of two adjacent DNA fragments each comprising a target nucleotide will result in minimal de novo sequence information, and the loss of sequence information from the same ligation product comprising the two target nucleotide sequences from which longer amplicons comprising the target nucleotide sequence, and potential unknown sequences, could have been generated just using a single primer for a single target nucleotide sequence.
Similarly, in linear amplification based approaches using universal adapters at either end of ligation products generated in a TLA protocol, the amplicons generated on ligation products that comprise multiple target nucleotide sequences will only consist of the sequence in between the primer facing the universal adapter sequence that is closest to this adapter sequence. As a result, the rest of the ligation product is lost and not included in the enrichment. An example of a multiplex linear PCR using combinations of primers specific for different target nucleotide sequences and a primer complementary to universal adapter sequences ligated to ligation products generated is provided in Figure 2.
The inclusion of a primer complementary to the target nucleotide sequence shown in red (*) will result in additional amplicons originating from ligation products that do not contain other target nucleotide sequences (such as the second ligation product shown in the Figure). However, the inclusion of a primer against this target nucleotide sequence will shorten those amplicons resulting from ligation products that contain other primer-target nucleotide sequences that occur at greater physical distance from the universal adapter sequence. This applies to the third ligation product shown in Figure 2. Further, in the exemplary set of ligation products shown in Figure 2, the inclusion of primers against each/either of the yellow (A) and green (°) target nucleotide sequences in the multiplex PCR only results in shorter amplicons and the loss of sequence information from the untargeted DNA fragments of interest (shown in grey (□)) in the first and fourth ligation products; respectively (see Figure 2).
Given the nature of physical proximity based protocols, these limitations are most likely in enrichments using target nucleotide sequences that originally occurred in relative physical proximity to each other (e.g. within a 1 kb, 2kb or 10kb distance).
The present invention is directed to methods for enriching DNA from a genomic region of interest, which methods facilitate the use of multiple target nucleotide sequences within the genomic region of interest during enrichment strategies.
In particular, the present methods comprise a step of performing a non-selective amplification of the ligated DNA products generated from the initial fragmentation and ligation steps of a proximity- based sequencing technique, such as TLA. The non-selective amplification of the ligated DNA products generates multiple copies of each unique ligation product that was present in the initial starting material (see Figure 1).
Subsequent to the non-selective amplification, the amplified, ligated DNA generated is separated into at least a first sample and a second sample. Multiple, separate enrichments are then performed using primers, for example, directed against different (combinations of) target nucleotide sequences. As such, the non-selective (universal) amplification followed by physical separation and separate enrichments enables comprehensive amplification of an increased number of ligation products comprising a greater number of target nucleotide sequences from a genomic region of interest. The present methods thus enable the effective use of multiple target nucleotide sequences in protocols that maximize, for example, the length and number of enriched ligation products.
Accordingly, the present invention enables improved enrichment of DNA fragments which do not comprise a target nucleotide sequence from a genomic region of interest comprising a plurality of target nucleotide sequences. As such, the present invention improves the ability to determine unknown sequences within a genomic region of interest.
Accordingly, in one aspect the present invention provides a method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
In another aspect the present invention provides a method for enriching DNA fragments which do not comprise a target nucleotide sequence from a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
In another aspect, the invention provides a method for making a DNA sequencing library of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample; h) optionally, determining at least part of the sequence of the enriched DNA generated from each sample in step g), preferably using high throughput sequencing. In a further aspect, the invention provides a method for determining the sequence of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample; i) determining at least part of the sequence of the enriched DNA generated from each sample in step g), preferably using high throughput sequencing.
The application of a non-selective amplification prior to a physical separation and enrichment is particularly advantageous in a proximity-based sequencing technique because of the fact that each ligation product produced during the initial fragmentation and ligation procedure is essentially unique. Accordingly, performing a non-selective amplification to amplify essentially all ligation products means that sequence information from a specific ligation product will not be lost because that ligation product is subject to an enrichment that is not directed to the target nucleotide sequence it contains or the ligation products consist of a combination of DNA fragments that result in short amplicons in a multiplex PCR. In contrast, this could easily occur if a physical separation and separated enrichment strategy was followed without an earlier non-selective amplification of essentially all ligation products. This is even more advantageous for samples with limited starting material, for example FFPE tumour biopsies. Potential advantages of the present methods thus include, but are not limited to, the ability to maximise the length of PCR amplicons comprising target nucleotide sequences generated and thus increase amount the sequence information subsequently retrievable; increasing the number of ligation products that can be meaningfully enriched based on one or more target nucleotide sequences; and increasing the number distinct PCR amplicons that can be generated from ligation products comprising one or more target nucleotide sequences. In general, the present methods may therefore maximise the enrichment of unknown sequences which are present in ligation products comprising a target nucleotide sequence generated during a proximity-based sequencing library preparation.
The methods also provide advantages in the analysis of samples that contain a low number of copies of a (trans)gene of interest. For instance, viral transgene sequences can integrate and occur in low frequencies. For reasons described above, the method improves the efficiency and completeness with which such transgene sequences and their integration sites can be enriched and sequenced.
The methods of the invention also present advantages in the analysis of samples with heterogeneous I random transgene integration sites.
Heterogeneous I random transgene integrations occur in a wide variety of sample types that occur naturally or are generated in industry. Random transgene integrations for instance occur in a variety of gene therapy products. In such samples, TLA-based detection of integration sites depends on the sequencing of the breakpoint sequence between the transgene genome and the host genome at the position of its integration site. However, given the nature of the TLA protocol, a monoplex PCR specific for a certain target nucleotide sequence will only enrich the subset of DNA fragments that have ended-up in the same ligation product. Therefore, only a subset of breakpoint sequences of interest will occur in the ligation products that are enriched with a single transgene specific target nucleotide sequence. A monoplex TLA based enrichment will thus only enable the enrichment and sequencing of a subset of such breakpoint sequences.
Multiplex enrichment is thus required to increase the number of to be sequenced breakpoint sequences. However, multiplex enrichment based on multiple transgene specific target nucleotide sequences suffers from the drawbacks described above.
Simply separating the initial ligation products generated in step d) without the non-selective amplification of present step e) in order to perform multiple monoplex enrichments will also not increase the yield of breakpoint sequences. This is because each ligation product comprising a breakpoint sequence is unique and will thus only be enriched if it ends up in the sample that is enriched with a transgene specific target nucleotide sequence that also occurs in the same unique ligation product.
The inclusion of present step e) means that copies of each ligation product will occur in each separated sample and that each breakpoint sequence that occurs in a ligation product generated in step c) that comprises at least one of the target nucleotide sequences used in the multiple enrichments will be enriched and sequenced.
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 : Schematic overview of an illustrative method of the invention
Figure 2: A graphical illustration of the result of a multiplex PCR using primers complimentary to four target nucleotide sequences and a universal adaptor sequence. Preferential amplification of products generated by primers directed against two proximal target nucleotide sequences results in the generation of short amplification products with minimal de novo sequence information, and the loss of sequence information from the same ligation product comprising the two target nucleotide sequences from which longer amplicons comprising the target nucleotide sequence, and potential unknown sequences, could have been generated just using a single primer for a single target nucleotide sequence.
Figure 3: A graphical illustration of the result of a separate PCR amplifications on the result of the non-selective amplification of the ligation product shown in Figure 2 using primers complimentary to four target nucleotide sequences and a universal adaptor sequence.
DETAILED DESCRIPTION OF THE INVENTION
In the following description and examples, a number of terms are used. In order to provide a clear and consistent understanding of the specification and claims, including the scope to be given such terms, the following definitions are provided. Unless otherwise defined herein, all technical and scientific terms used have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
Methods of carrying out the conventional techniques used in methods of the invention will be evident to the skilled worker. The practice of conventional techniques in molecular biology, biochemistry, computational chemistry, cell culture, recombinant DNA, bioinformatics, genomics, sequencing and related fields are well-known to those of skill in the art and are discussed, for example, in the following literature references: Sambrook et al., Molecular Cloning. A Laboratory Manual, 2nd Edition, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N. Y., 1989; Ausubel et al., Current Protocols in Molecular Biology, John Wiley & Sons, New York, 1987 and periodic updates; and the series Methods in Enzymology, Academic Press, San Diego.
As used herein, the singular forms "a," "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, a method for isolating "a" DNA molecule, as used above, includes isolating a plurality of molecules (e.g. 10's, 100's, 1000 's, 10's of thousands, 100's of thousands, millions, or more molecules).
As used herein, the term “nucleic acid” may include any polymer or oligomer of pyrimidine and purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (see Albert L. Lehninger, Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which is herein incorporated by reference in its entirety for all purposes). The present invention contemplates any deoxyribonucleotide, ribonucleotide or peptide nucleic acid component, and any chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of these bases, and the like. The polymers or oligomers may be heterogeneous or homogenous in composition, and may be isolated from naturally occurring sources or may be artificially or synthetically produced. In addition, the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single-stranded or doublestranded form, including homoduplex, heteroduplex, and hybrid states.
As used herein, with the term “aligning” and “alignment” is meant the comparison of two or more nucleotide sequence based on the presence of short or long stretches of identical or similar nucleotides. Methods and computer programs for alignment are well known in the art. One computer program which may be used or adapted for aligning is "Align 2", authored by Genentech, Inc., which was filed with user documentation in the United States Copyright Office, Washington, D.C. 20559, on Dec. 10, 1991.
Method for enriching a genomic region of interest
In one aspect, the present invention provides a method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
As used herein, the term “genomic region of interest” refers to a DNA sequence of an organism of which it is desirable to determine, at least part of, the DNA sequence. For instance, a genomic region which comprises, or is suspected of comprising, an allele associated with a disease may be a genomic region of interest. Another example is a genomic region which comprises a vector insertion site. In this case, the whole genome sequence may be determined, suitably following deselection of episomal copies of the vector I transgene sequence. Thus, in some embodiments, the genomic region of interest is the whole genome.
As used herein, the term “target nucleotide sequence” refers to a DNA sequence of interest within a genomic region of interest. For example, the target nucleotide sequence may be a transgene or a portion thereof. Suitably, the target nucleotide sequence may be an allele or a portion thereof. The target nucleotide sequence is used in the enrichment steps described herein.
The present methods comprise enrichment for a genomic region of interest comprising a plurality of target nucleotide sequences. As such, the present methods may comprise performing an enrichment for at least two, at least three, at least four, at least five, at least eight or at least ten target nucleotide sequences.
By fragmenting a sample of crosslinked DNA, the DNA fragments that originate from a genomic region of interest remain in proximity of each other because they are crosslinked. When these crosslinked DNA fragments are subsequently ligated, DNA fragments of the genomic region of interest, which are in the proximity of each other due to the crosslinks, are ligated. This type of ligation is also referred to as proximity ligation. DNA fragments comprising the target nucleotide sequence may ligate with DNA fragments within a large linear distance at the sequence level. By determining (at least part of) the sequence of ligation products that comprise the fragment comprising the target nucleotide sequence, sequences of DNA fragments within the spatial surrounding of the genomic region of interest are obtained. Each individual target nucleotide sequence is likely to be crosslinked to multiple other DNA fragments. As a consequence, often more than one DNA fragment may be ligated to a fragment comprising the target nucleotide sequence and, in a sample comprising multiple copies of a genomic region of interest, each individual DNA fragment comprising the target nucleotide sequence may ligate to different combinations of DNA fragments originating from the genomic region of interest. By combining (partial) sequences of the (amplified) ligation products in which DNA fragments were ligated with a fragment comprising the target nucleotide sequence, a sequence of the genomic region of interest may be built. A DNA fragment ligated with the fragment comprising the target nucleotide sequence includes any fragment which may be present in ligation products.
As used herein, the term “ligation product” means a DNA sequence which is generated by ligating DNA fragments together. Thus, a ligation product comprises at least two DNA fragments. In the context of the present invention, the DNA fragments which are subsequently ligated have been produced by a previous fragmentation step.
Methods are known in the art that involve crosslinking DNA, as well as fragmenting and ligating the DNA fragments (e.g. WO 2007/004057, WO 2012/055595 and de Vree et al.-, Nature Biotechnology; 32; 1019-1025; 2014). Thus, approaches for performing steps a)-d) of the methods of the invention are known.
The methods of the invention have the advantages that extensive sequence information is not required to focus on the genomic region of interest and the method is not sequence-biased (i.e. bias by using oligonucleotides and/or probes which cover the transgene of interest, allelic sequence of interest, or flanking sequences surrounding the sequence of interest, is avoided).
In some embodiments, the methods of the invention may be used in the analysis of the 3D folding of regions of interest. Methods for the analysis of the 3D folding of regions of interest are known in the art (see, for example, Sungalee et al (2021) Nature Genetics 53: 650-662). Thus, the methods of the invention can be applied to the analysis of the 3D folding of regions of interest. Sample of crosslinked DNA
In step a) a sample of crosslinked DNA is provided.
As used herein, the sample may be obtained from an organism or from a tissue of an organism, or from tissue and/or cell culture, which comprises DNA. The sample DNA may be from an organism may be obtained from any type of organism, e.g. micro-organisms, viruses, plants, fungi, animals, humans and bacteria, or combinations thereof. For example, a tissue sample from a human patient suspected of a bacterial and/or viral infection may comprise human cells, but also viruses and/or bacteria. The sample may comprise cells and/or cell nuclei. Suitably, the sample may comprise or consist of isolated DNA.
In some embodiments, the sample DNA is from a patient or a person who may be at risk of, suspected of having, or has a particular disease, for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
In some embodiments, the sample DNA is from a patient or a person who is undergoing or has undergone gene therapy, for example using a lentiviral vector. Suitably, the sample DNA is from a cell culture that has been subject to transfection or transduction with a transgene, for example using a lentiviral vector.
Samples may be taken from a patient and/or from diseased tissue, and may also be derived from other organisms or from separate sections of the same organism, such as samples from one patient, one sample from healthy tissue and one sample from diseased tissue. Samples may thus be analysed according to the invention and compared with a reference sample, or different samples may be analysed and compared with each other. For example, for a patient being suspected of having cancer, a biopsy may be obtained from the suspected tumour. Another biopsy may be obtained from non-diseased tissue. Both tissue biopsies may be analysed according to the invention. Genomic regions of interest may be those containing a gene associated with the cancer type (e.g. the BRCA1 and BRCA2 gene, which are 83 and 86 kb long, respectively (reviewed in Mazoyer, 2005, Human Mutation 25:415-422), for suspected breast cancer). By determining the sequence of the genomic region of interest according to the invention and comparing the sequences of the genomic region from the different biopsies with each other and/or with a reference gene sequence (e.g. a reference BRCA gene sequence), genetic mutations may be found that will assist in diagnosing the patient and/or determining treatment of the patient and/or predicting prognosis of disease progression. Suitably, the sample may be a formalin cross-linking sample. Suitably, the sample may be a paraffin embedded sample. In particular, the sample may be a Formalin-Fixed Paraffin- Embedded (FFPE) sample.
The sample may be a tissue sample. The sample may be a tumour sample.
Suitably, the sample may be a FFPE tumour sample.
The sample may be a slice or a puncture from a FFPE sample.
As used herein, the term “crosslinking” means reacting DNA at two different positions, such that these two different positions may be connected. The connection between the two different positions may be direct, forming a covalent bond between DNA strands. Two DNA strands may be crosslinked directly using UV-irradiation, forming covalent bonds directly between DNA strands. The connection between the two different positions may be indirect, via an agent, e.g. a crosslinker molecule. A first DNA section may be connected to a first reactive group of a crosslinker molecule comprising two reactive groups, that second reactive group of the crosslinker molecule may be connected to a second DNA section, thereby crosslinking the first and second DNA section indirectly via the crosslinker molecule. A crosslink may also be formed indirectly between two DNA strands via more than one molecule. For example, a typical crosslinker molecule that may be used is formaldehyde. Formaldehyde induces protein-protein and DNA-protein crosslinks. Formaldehyde thus may crosslink different DNA strands to each other via their associated proteins. For example, formaldehyde can react with a protein and DNA, connecting a protein and DNA via the crosslinker molecule.
Hence, two DNA sections may be crosslinked using formaldehyde forming a connection between a first DNA section (DNA1) and a protein, the protein may form a second connection with another formaldehyde molecule that connects to a second DNA section (DNA2), thus forming a crosslink which may be depicted as DNA1-crosslinker-protein-crosslinker-DNA2. In any case, it is understood that crosslinking according to the invention involves forming connections (directly or indirectly) between strands of DNA that are in physical proximity of each other. DNA strands may be in physical proximity of each other in the cell, as DNA is highly organised, while being separated from a linear sequence point of view e.g. by 100kb. As long as the crosslinking method is compatible with subsequent fragmenting and ligation steps, such crosslinking may be contemplated for the purpose of the invention.
As used herein, the term a “sample of crosslinked DNA” refers to sample DNA which has been subjected to crosslinking. Crosslinking the sample DNA has the effect that the three- dimensional state of the DNA within the sample remains largely intact. This way, DNA strands that are in physical proximity of each other remain in each others’ vicinity. Thus, crosslinking the sample DNA as it is present in the sample results in largely maintaining the three dimensional architecture of the DNA.
Fragmenting crosslinked DNA
The sample of crosslinked DNA is fragmented in step b). By fragmenting the crosslinked DNA, DNA fragments are produced which are held together by the crosslinks.
As used herein, the term “fragmenting DNA” includes any technique that, when applied to the DNA, results in DNA fragments. Techniques well known in the art are sonication, shearing and/or enzymatic restriction, but other techniques can also be envisaged. Fragmenting techniques may result in random fragmentation of the DNA (e.g. sonication or shearing). Suitably, the fragmenting technique may result in non-random (i.e. targeted) fragmentation of the DNA (e.g. restriction enzymes or site-directed nucleases). Where a given step of the methods of the invention specifically requires random or non-random fragmentation, this is specified.
By “random fragmentation” of the DNA it is meant that the fragmenting technique results in DNA fragments with unknown end sequences. As an example, sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in a subsequent step. Alternatively, the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases.
The fragmenting step b) may comprise sonication, and may be followed by enzymatic DNA end repair. Sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in the subsequent step c). Alternatively, the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases. Suitably, the fragmenting step b) may be performed using S1 nuclease to generate blunt ended fragments. Suitably, the fragmenting step b) may be performed using DNasel.
By “non-random fragmentation” of the DNA it is meant that the fragmenting technique results in DNA fragments with known end sequences, i.e. that the fragmenting is targeted. Suitably, non-random fragmentation involves fragmenting at a specific recognition sequence. Suitably, the non-random fragmentation of the DNA may be performed using a site-directed nuclease or a restriction enzyme which targets a specific recognition sequence. In the context of the use of a restriction enzyme, the specific recognition sequence is also referred to herein as a “restriction enzyme site” or “restriction site”.
As used herein, the term “recognition sequence” means a specific nucleotide sequence which is recognised by a fragmenting technique (e.g. a site-directed nuclease or restriction enzyme) and directs cleavage of the DNA molecule at or near the recognition sequence. The specific nucleotide sequence which is recognized may determine the frequency of cleaving, e.g. a nucleotide sequence of 6 nucleotides occurs on average every 4096 nucleotides, whereas a nucleotide sequence of 4 nucleotides occurs much more frequently, on average every 256 nucleotides.
In some embodiments, the fragmenting step b) comprises fragmenting with a restriction enzyme. The fragmenting step b) may comprise fragmenting with one or more restriction enzymes, or combinations thereof.
In some embodiments, the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 4 to 8 nucleotides in length, more preferably 4 to 6 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 3, 4, 5, 6, 7, 8, 9, 10, 11 or 12 nucleotides in length.
As used herein, the terms “restriction endonuclease” and “restriction enzyme” are used interchangeably to mean an enzyme that recognizes a specific nucleotide sequence (i.e. recognition sequence) in a double-stranded DNA molecule, and will cleave both strands of the DNA molecule at or near every recognition sequence, leaving a blunt end or a 3’- or 5’- overhanging end.
In some embodiments, the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
As used herein, the term “site-directed nuclease” means a DNA-cutting enzyme (nuclease) which is directed to recognize a predetermined specific nucleotide sequence (i.e. recognition sequence) and to cleave both strands of the DNA molecule at or near every recognition sequence. The site-directed nuclease may be engineered to target a desired recognition sequence. Suitably, the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et a! , Nucleic Acids Research; 2020; 48(4); e19) or a CRISPR-Cas nuclease (e.g. Cas9). Preferably, the site-directed nuclease is a CRISPR-Cas nuclease.
In some embodiments, step b) comprises fragmenting the crosslinked DNA of step a) by nonrandom fragmentation of the DNA at a recognition sequence using (synthetic) Cas9 or Cas12a.
Methods for manipulating a gRNA sequence to direct a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage are known in the art (see, for example, Jinek M, et al. (2012) Science 337: 816-821 ; and Kim H et al., (2017) Nature Communications 8: 14406). Thus, designing a suitable gRNA and directing a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage is within the ambit of the skilled person.
The methods of the invention may also be performed using DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively. For example, certain restriction enzymes preferentially fragment methylated DNA (as compared to unmethylated DNA) whilst other restriction enzymes preferentially fragment unmethylated DNA (as compared to methylated DNA).
Thus, the methods of the invention are applicable to the sequencing of alleles in which known sequences are either methylated or unmethylated. For example, the promoter sequences of actively transcribed genes are typically unmethylated, whereas the corresponding gene body sequences can contain enriched levels of methylation. Thus, the digestion of unmethylated DNA or methylated DNA will result in the deselection of either the promoter or corresponding gene body sequence, respectively. Accordingly, the methods of the invention permit the enrichment and sequencing of the promoter or corresponding gene body sequence.
In conventional DNA methylation analyses bisulphite treatment is used. Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. Such analyses do not enable the selective sequencing of alleles in which target nucleotide sequences have or have not been methylated. In addition, these approaches do not provide information regarding which genetic variants occur in methylated alleles of interest or unmethylated alleles of interest beyond the length of the sequencing reads generated in the sequencing analysis. Suitably, the methods of the invention may be used in combination with bisulfite treatment. In this way, the methods may be used for the sequencing and quantification of epigenetic changes in alleles in which known sequences are either methylated or unmethylated.
Examples of nucleases that can be used in such analyses are site-specific methyl-directed (MD) DNA endonucleases (e.g. Glal). These enzymes recognise and cleave methylated DNA sequences only and do not cleave unmethylated DNA sequences (Tarasova etal. (2008) BMC Mol. Biol. 9: 7). Suitably, a restriction enzyme or site-directed nuclease may be used.
Examples of methylation-sensitive restriction enzymes that fragment unmethylated DNA include:
• Dpnl and Dpnll for N6-methyladenine detection within GATC recognition site; and
• Hpall and Mspl for C5-methylcytosine detection within CCGG recognition site.
In the application of the methods of the invention to the analysis of the 3D folding of regions of interest, nucleases which preferentially fragment methylated and/or unmethylated DNA may be used.
The fragmenting step b) may comprise fragmenting with one or more site-directed nucleases, or combinations thereof.
Fragmenting with a restriction enzyme or site-directed nuclease is advantageous as it may allow greater control of the average fragment size. The fragments that are formed may have compatible overhangs or blunt ends that allow ligation of the fragments in the subsequent step c).
Furthermore, when dividing a sample of cross-linked DNA into a plurality of subsamples, for each subsample restriction enzymes or site-directed nucleases with different recognition sites may be used. This is advantageous because by using different restriction enzymes or site- directed nucleases having different recognition sites, different DNA fragments can be obtained from each subsample.
Accordingly, in some embodiments, the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
Ligation
In the next step c), the fragments are ligated. As used herein, “ligating” involves the joining of separate DNA fragments. The DNA fragments may be blunt ended, or may have compatible overhangs (also termed sticky overhangs or sticky ends) such that the overhangs can hybridise with each other. The joining of the DNA fragments may be enzymatic, with a ligase enzyme, DNA ligase. However, a non-enzymatic ligation may also be used, as long as DNA fragments are joined, i.e. forming a covalent bond. Typically a phosphodiester bond between the hydroxyl and phosphate group of the separate strands is formed.
Since a fragment comprising a target nucleotide sequence may be crosslinked to multiple other DNA fragments, more than one DNA fragment may be ligated to the fragment comprising the target nucleotide sequence. This may result in combinations of DNA fragments which are in proximity of each other as they are held together by the cross links. Different combinations and/or order of the DNA fragments in ligated DNA fragments may be formed. In case the DNA fragments are obtained via enzymatic restriction or using a site-directed nuclease, the recognition sequence of the restriction enzyme or site-directed nuclease is known, which makes it possible to identify the fragments as remains of or reconstituted recognition sequences may indicate the separation between different DNA fragments.
Irrespective of what fragmenting method is used, the ligation step c) may be performed in the presence of an adaptor, ligating adaptor sequences in between fragments. Alternatively, the adaptor may be ligated in a separate step. This is advantageous because the different fragments can be easily identified by identifying the adaptor sequences which are located in between the fragments. For example, in case DNA fragment ends were blunt ended, the adaptor sequence would be adjacent to each of the DNA fragment ends, indicating the boundary between separate DNA fragments.
As used herein, the term “adaptor” refers to a short double-stranded oligonucleotide molecule with a limited number of base pairs, e.g. about 10 to about 30 base pairs in length, which are designed such that they can be ligated to the ends of fragments. Adaptors are generally composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other. When mixing the two synthetic oligonucleotides in solution under appropriate conditions, they will anneal to each other forming a double-stranded structure. After annealing, one end of the adaptor molecule may be designed such that it is compatible with the end of a restriction fragment and can be ligated thereto; the other end of the adaptor can be designed so that it cannot be ligated, but this does need not to be the case, for instance when an adaptor is to be ligated in between DNA fragments. Reversing the crosslinking
Next, the crosslinking is reversed in step d), which results in a pool of ligated DNA fragments that comprise two or more fragments. A subpopulation of the pool of ligated DNA fragments comprises a DNA fragment which comprises the target nucleotide sequence. By reversing the crosslinking, the structural/spatial fixation of the DNA is released and the DNA sequence becomes available for subsequent steps, e.g. amplification and/or sequencing, as crosslinked DNA may not be a suitable substrate for such steps.
As used herein, “reversing crosslinking” comprises breaking the crosslinks such that the DNA that has been crosslinked is no longer crosslinked and is suitable for subsequent amplification and/or sequencing steps. For example, performing a protease K treatment on a sample DNA that has been crosslinked with formaldehyde will digest the protein present in the sample. Because the crosslinked DNA is connected indirectly via protein, the protease treatment in itself may reverse the crosslinking between the DNA. However, the protein fragments that remain connected to the DNA may hamper subsequent sequencing and/or amplification. Hence, reversing the connections between the DNA and the protein may also result in “reversing crosslinking”. The DNA-crosslinker-protein connection may be reversed through a heating step for example by incubating at 70°C. Since in a sample DNA large amounts of protein is present, it is often desirable to digest the protein with a protease in addition. Hence, any “reversing crosslinking” method may be contemplated wherein the DNA strands that are connected in a crosslinked sample becomes suitable for sequencing and/or amplification.
Non-selective amplification
The present methods require performing a non-selective amplification of the ligated DNA products generated from the initial fragmentation and ligation steps of a proximity-based sequencing technique, such as TLA.
In particular, step e) of the present methods comprises a non-selective amplification of the ligated DNA generated in step d).
As used herein, a “non-selective” amplification may also be referred to as a “universal”, “nonspecific” or “whole genome” amplification.
Thus a “non-selective” amplification refers to the use of an amplification technique which generates an amplified product that is completely representative of the initial starting material. Thus, when the non-selective amplification step is applied to a mixture of unique ligation products resulting from the conventional TLA protocol, each unique ligation product is expected to be amplified and multiple copies generated.
As such, the non-selective amplification of the ligation products generated in step d) of the present methods generates multiple copies of each unique ligation product that was present in the initial starting material.
Suitable non-selective amplification methods, for example methods for amplifying the whole genome, are known in the art (see, for example, Kittier et al. (2002) Analytical Biochemistry 300: 237-244; Hard et al. (2021) bioRxiv 439527; doi: https://doi.org/10.1101/2021.04.13.439527; Telenius et al. (1992) Genomics 13(3) 718-725; WO2019/148119; Qiagen REPLI-g FFPE Kit (Cat. No. I ID: 150243); Langmore (2002) Pharmacogenomics 3: 557-560).
The non-selective amplification may be multiple strand displacement (e.g. as descrined in Telenius etal. - as above). The non-selective amplification may be performed using a Qiagen REPLI-g FFPE Kit.
Suitably, the non-specific amplification may increase the amount of DNA in the starting material by at least 4-, 5-, 10-, 20-, 50-, 100- or 1000-fold.
Suitably, in embodiments of the present invention, the non-selective amplification step may be optional. That is, the non-selective amplification step may be omitted. As such, all features and embodiments of the methods described herein may be applied to corresponding methods in which the non-selective amplification step is not performed. By way of example, the non- selective amplification may not be required if sufficient input material is available, but physical separation of the ligation products in order to perform separate enrichments with different primers specific for a genomic region of interest will still present quality advantages in terms of the length of amplicons generated, for example.
Separation into at least a first and a second sample
Following the non-selective amplification of step e); the amplified, ligated DNA generated is physically separated into at least a first sample and a second sample.
The method may comprise separating the amplified, ligated DNA into any number of separate samples - as required. For example, the method may comprise separating the amplified, ligated DNA generated in step e) into at least 3, at least 4, at least 5, at least 8 or at least 10 sub-samples. The physical separation of step e) may be performed such that at least one copy of each essentially unique ligation product (which will have been multiplied during the non-specific amplification step) is present in each separate sample generated in step e). As such, in the subsequent enrichment of step g) any ligation product comprising a particular target nucleotide sequence to be enriched for should be present in each sample. The present methods thus minimize the loss of sequence information due to a particular ligation product not being present in a sample in which an enrichment using a given target nucleotide sequence is performed.
The size - or amount of amplified, ligated DNA - in each separated sample is not particularly limiting and may be determined based on the specifics of the method to be performed.
Enriching
Next, the method comprises enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample.
Suitably the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample. Suitably, where further separated samples are generated in step f), each is enriched for at least one target nucleotide sequence which is not enriched for in any other sample generated in step f).
Performing multiple, separate enrichments directed against different (combinations of) target nucleotide sequences enables comprehensive enrichment of an increased number of ligation products comprising a greater number of target nucleotide sequences from a genomic region of interest. Without wishing to be bound by theory, the present methods thus enable the effective use of multiple target nucleotide sequences in protocols that maximize, for example, the length and number of enriched ligation products.
As used herein, the term “enriching” or “enrichment” for “DNA comprising the at least one of the plurality of target nucleotide sequences” means a process by which the (absolute) amount and/or proportion of the DNA comprising the target nucleotide sequence is increased compared to the amount and/or proportion of DNA comprising the target nucleotide sequence in the starting material (i.e. in the initial separated sample generated in step f)). In this regard, enrichment by amplification increases the amount and proportion of DNA comprising the target nucleotide sequence. Both enrichment by degradation and capture-based enrichment increase the proportion of DNA comprising the target nucleotide sequence. The methods of the invention are compatible with a wide variety of enrichment approaches. Suitable enrichment methods include, but are not limited to, PCR amplification, capture-based enrichment and/or site-directed nuclease digestion.
Suitably, enrichment step g) may comprise performing a singleplex enrichment in one or more of the separate samples generated in step f). A singleplex enrichment refers to an enrichment using a single target nucleotide sequence of the plurality of target nucleotide sequences in a sample generated in step f). Suitably, enrichment step g) may comprise performing a singleplex enrichment using a single target nucleotide sequence of the plurality of target nucleotide sequences in each separate sample generated in step f). For example, one enrichment corresponding to 1 of 5 target nucleotide sequences may be used individually in one sample generated in step f) (e.g. A, B, C, D or E). Alternatively, one of the 5 target nucleotide sequences may be used individually in each of a set of 5 separate samples generated in step f) (e.g. A, B, C, D and E are each individually enriched in a separate sample).
Suitably, enrichment step g) may comprise performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in one or more of the separate samples generated in step f). The term “multiplex” is generally used herein to refer to an enrichment strategy in which at least two target nucleotide sequences are used to enrich for ligation products comprising those target nucleotide sequences in a single sample.
For example, an enrichment using 2 of 5 target nucleotide sequences may be used in combination in at least one sample generated in step f) (e.g. A and B, A and C, A and D or A and E etc.). Alternatively, an enrichment using 3 of 5 target nucleotide sequences may be used in combination in at least one sample generated in step f) (e.g. A and B and C; A and C and D; or A and D and E etc.).
Suitably, enrichment step g) may comprise performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in each of the separate samples generated in step f).
Suitably, enrichment step g) may comprise performing a multiplex enrichment using at least two, at least three, at least four, at least five, at least eight or at least ten target nucleotide sequences in one or more of the separate samples generated in step f).
The target nucleotide sequences used in separate enrichments may originate from genomic positions within a 1 kb, 2kb or 10kb physical distance in the linear genomic sequence. In embodiments where a non-random fragmentation is performed in step b) (e.g. using a restriction enzyme) the target nucleotide sequences used in separate enrichments may originate from adjacent restriction fragments generated in the non-random fragmentation
Enrichment step g) may comprise a PCR amplification using primers against one or more target nucleotide sequence as described herein.
Enrichment step g) may comprise target nucleotide sequences originating from both DNA strands. PCR amplifications may therefore comprise primers against both DNA strands.
As used herein, the terms “oligonucleotide primers” or “primers” are used interchangeably, in general, to refer to strands of nucleotides which can prime the synthesis of DNA. DNA polymerase cannot synthesize DNA de novo without primers. A primer hybridises to the DNA, i.e. base pairs are formed. Nucleotides that can form base pairs, that are complementary to one another, are e.g. cytosine and guanine, thymine and adenine, adenine and uracil, guanine and uracil. The complementarity between the primer and the existing DNA strand does not have to be 100%, i.e. not all bases of a primer need to base pair with the existing DNA strand. From the 3’-end of a primer hybridised with the existing DNA strand, nucleotides are incorporated using the existing strand as a template (template directed DNA synthesis). The synthetic oligonucleotide molecules which are used in an amplification reaction may be referred to as “primers”.
As used herein, the term “amplifying” refers to a polynucleotide amplification reaction, namely, a population of polynucleotides that are replicated from one or more starting sequences. Amplifying may refer to a variety of amplification reactions, including but not limited to polymerase chain reaction (PCR), linear polymerase reactions, nucleic acid sequence- based amplification, rolling circle amplification and like reactions.
By way of example, the amplified, ligated DNA may be circularised after e) and prior to step f) or step g); and step g) may be performed by PCR enrichment using inverse primer pairs, wherein at least one of the primers of the inverse primer pair is specific for the target nucleotide sequence. Preferably, each primer of the inverse primer pair is specification for the target nucleotide sequence.
Accordingly, the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); e’) circularizing the amplified, ligated DNA generated in step e); f) separating the circularized DNA generated in step e’) into at least a first sample and a second sample; g) enriching at least the first and the second samples of circularized DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences using inverse primer pairs; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
Alternatively, the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the circularized DNA generated in step e) into at least a first sample and a second sample; f’) circularizing the amplified, ligated DNA of the at least the first and the second samples generated in step e); g) enriching at least the first and the second samples of circularized DNA generated in step f’) for DNA comprising at least one of the plurality of target nucleotide sequences using inverse primer pairs; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
Alternatively, universal adapters may be ligated to the amplified, ligated DNA after step e) and prior to step f) or step g); and step g) may be performed by PCR enrichment with primer pairs complementary to the target nucleotide sequence and the universal adapter.
Accordingly, the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); e’) ligating universal adaptors to the amplified, ligated DNA generated in step e); f) separating the amplified, ligated DNA generated in step e’) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences using primer pairs complementary to the target nucleotide sequence and the universal adapter; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
Alternatively, the present method may comprise the following steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; f’) ligating universal adaptors to the amplified, ligated DNA generated in step f); g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f’) for DNA comprising at least one of the plurality of target nucleotide sequences using primer pairs complementary to the target nucleotide sequence and the universal adapter; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample. In an alternative embodiment, universal adapters may be ligated to the ligated DNA generated in step d) (i.e. prior to the non-selective amplification performed in step e).
In any embodiment wherein universal adapters are ligated to the DNA prior to enrichment step g), universal primers targeting the universal adapters can be used in combination with a target sequence specific primer in a PCR-based enrichment in step g).
The present methods have particular advantages when used in the context of PCR-based enrichment strategies. For example, potential advantages include, but are not limited to, the ability to maximise the length of PCR amplicons comprising target nucleotide sequences generated and thus increase amount the sequence information subsequently retrievable; increasing the number of ligation products that can be meaningfully enriched based on one or more target nucleotide sequences; and increasing the number distinct PCR amplicons that can be generated from ligation products comprising one or more target nucleotide sequences.
In some embodiments, an identifier is included in the at least one primer.
As used herein, the term “identifier” refers to a short sequence that can be added to an adaptor or a primer or included in its sequence or otherwise used as label to provide a unique identifier. Such a sequence identifier (or tag) can be a unique base sequence of varying but defined length, typically from 4-16 bp used for identifying a specific nucleic acid sample. For instance 4 bp tags allow 4(exp4) = 256 different tags. Typical examples are ZIP sequences, known in the art as commonly used tags for unique detection by hybridization (lannone et al. Cytometry 39:131-140, 2000). Identifiers are useful according to the invention, as by using such an identifier, the origin of a sample (e.g. a PCR sample) can be determined upon further processing. In the case of combining processed products originating from different nucleic acid samples, the different nucleic acid samples may be identified using different identifiers. For instance, as according to the invention sequencing may be performed using high throughput sequencing, multiple samples may be combined. Identifiers may then assist in identifying the sequences corresponding to the different samples. Identifiers may also be included in adaptors for ligation to DNA fragments assisting in DNA fragment sequences identification. Identifiers preferably differ from each other by at least two base pairs and preferably do not contain two identical consecutive bases to prevent misreads. The identifier function can sometimes be combined with other functionalities such as adaptors or primers.
Suitably, in any of the methods as described herein, in step g) primers carrying a moiety, e.g. biotin, may be used for the optional purification of (amplified) ligated DNA fragments through binding to a solid support (e.g. streptavidin-coated beads). Capture-based enrichment using the moiety may then be performed as described below in the context of a hybridisation probe (except that the capture is performed by the biotin-streptavidin interaction, rather than hybridization of complementary nucleic acids).
In some embodiments, the enriching step g) comprises capture-based enrichment of the at least one of the plurality of target nucleotide sequences, preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
In one embodiment, the DNA fragments comprising the target nucleotide sequence may be captured with a hybridisation probe (also termed a capture probe) that hybridises to a target nucleotide sequence. The hybridisation probe may be attached directly to a solid support, or may comprise a moiety, e.g. biotin, to allow binding to a solid support suitable for capturing biotin moieties (e.g. beads coated with streptavidin). In any case, the DNA fragments comprising a target nucleotide sequence are captured thus allowing separation of ligation products comprising the target nucleotide sequence from ligation products not comprising the target nucleotide sequence. Hence, such a capture step allows enrichment for ligation products comprising the target nucleotide sequence. For a genomic region of interest comprising a target nucleotide sequence, at least one capture probe for the target nucleotide sequence may be used. For a genomic region of interest comprising a plurality of target nucleotide sequences, more than one probe may be used for multiple target nucleotide sequences (e.g. at least one probe for each target nucleotide sequence may be used). For example, one probe corresponding to 1 of 5 target nucleotide sequences may be used as a capture probe (A, B, C, D or E). Alternatively, the 5 probes may be used in a combined fashion (A, B, C, D and E) to capture the genomic region of interest.
Without wishing to be bound by theory, it is considered that - in the absence of a non-specific (universal) amplification and physical separation step provided by the present invention - capture based protocols applied to ligation products that contain multiple target nucleotide sequences for which capture probes are being used will be captured more efficiently (at the detriment of more interesting ligation products that only contain one target nucleotide sequence). Accordingly, the present invention enables a greater number of target nucleotide sequences to be used whilst minimizing the risk of missing ligation products that only contain one target nucleotide sequence.
In one embodiment, a capture probe may be used that hybridises to an adaptor sequence comprised in amplified, ligated DNA generated in step e). In one embodiment, an amplification step and capture step are combined, e.g. first performing a capture step and then an amplification step or vice versa.
Site-directed nuclease digestion can also be used for the selective amplification of ligation products of interest. Site-directed digestion can be used to selectively add adaptors to and enable the amplification of linear ligation products comprising the target nucleotide sequence.
Suitably, the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et a! , Nucleic Acids Research; 2020; 48(4); e19) or a CRISPR-Cas nuclease (e.g. Cas9). Preferably, the site- directed nuclease is a CRISPR-Cas nuclease.
Site-directed nuclease digestion can also be used for the selective linearization of ligation products of interest. If a proximity ligation protocol (e.g. a TLA protocol as described herein) is used for the generation of circular DNA template (e.g. circular TLA template), a site-directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nanopore sequencing approaches which will very preferentially sequence linearized DNA molecules.
Thus, in some embodiments, the enriching step g) comprises site-directed nuclease-based enrichment of the at least one of the plurality of target nucleotide sequences, preferably wherein the site-directed nuclease-based enrichment comprises using a site-directed nuclease followed by amplifying the digested DNA using (e.g. inverse) PCR or capture-based enrichment.
In one embodiment, an amplification step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then an amplification step or vice versa.
In one embodiment, a capture step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then a capture step or vice versa.
Suitably, a different enrichment strategy as described herein may be performed on at least a first sample and a second sample generated during step f). For example, a PCR-based enrichment may be performed on a first sample, and a capture-based or site-directed nuclease-based enrichment may be performed on a second sample. In other words, it is not necessary that the same enrichment strategy be used on each separated sample generated in step f).
Determining the sequence of ligated DNA fragments
Suitably, at least part of the sequence of the enriched DNA generated from each sample in step g) may be determined.
Suitably, at least two of the enriched samples generated in step f) are pooled together before the sequencing step is performed. Suitably, each of the enriched samples generated in step f) are pooled together before the sequencing step is performed.
The DNA may be prepared as a DNA sequencing library and/or sequenced according to standard protocols. Conventional whole genome sequencing or high-throughput sequencing (e.g. NGS) approaches can be used. Determining the sequence is preferably performed using high throughput sequencing technologies, as this is more convenient and allows a high number of sequences to be determined to cover the complete genomic region of interest.
As used herein, the term “DNA sequencing library” means a sequencing-ready DNA library. Thus, the methods of the invention generate a compatible library (e.g. an NGS compatible library) for sequencing applications.
In some embodiments, a DNA sequencing library of a plurality of genomic regions of interest is made.
In some embodiments, step h) is performed using whole genome sequencing. In particular, when the genomic region of interest comprises a transgene integration site, it may be desirable to sequence the whole genome.
In some embodiments, step h) comprises determining at least part of the sequence of the enriched DNA comprising the target nucleotide sequence. Suitably, step h) comprises determining the whole sequence of the enriched DNA comprising the target nucleotide sequence.
As used herein, the term “sequencing” refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA. Many techniques are available such as Sanger sequencing and High throughput sequencing technologies such as offered by Roche, Illumina and Thermo Fisher. The step of determining the sequence of DNA preferably comprises high throughput sequencing. High throughput sequencing methods are well known in the art, and in principle any method may be contemplated to be used in the invention. High throughput sequencing technologies may be performed according to the manufacturer’s instructions (as e.g. provided by Roche, Illumina or Thermo Fisher). In general, sequencing adaptors may be ligated to the (amplified) undigested DNA fragments. In case the linear or circularized fragment is amplified, by using for example PCR as described herein, the amplified product is linear, allowing the ligation of the adaptors. Suitable ends may be provided for ligating adaptor sequences (e.g. blunt, complementary staggered ends). Alternatively, primer(s) used for PCR or other amplification method, may include adaptor sequences, such that amplified products with adaptor sequences are formed. In case the circularized fragment is not amplified, the circularized fragment may be fragmented, preferably by using for example a restriction enzyme in between primer binding sites for the inverse PCR reaction, such that DNA fragments ligated with the DNA fragment comprising the target nucleotide sequence remain intact. Sequencing adaptors may also be included in step c) of the methods of the invention.
Preferably long reads may be generated in the high throughput sequencing method used. Long reads may allow reading across multiple DNA fragments within undigested DNA fragments (which contain ligated DNA fragments). This way, DNA fragments of step b) may be identified. DNA fragment sequences may be compared to a reference sequence and/or compared with each other.
Hence, it is not required to provide for a complete sequence of the undigested DNA fragments (i.e. ligation products). It is preferred to at least sequence across (multiple) DNA fragments, such that DNA fragment sequences are determined.
It may also be contemplated to read even shorter sequences, for instance, short reads of 50- 100 nucleotides. In case a standard sequencing protocol would be used, this may mean that the information regarding the undigested DNA fragments may be lost. With short reads it may not be possible to identify a complete DNA fragment sequence. In case such short reads are contemplated, it may be envisioned to provide additional processing steps such that separate ligated DNA fragments when fragmented, are ligated or equipped with identifiers, such that from the short reads, contigs may be built for the ligated DNA fragments. Such high throughput sequencing technologies involving short sequence reads may involve paired end sequencing. By using paired end sequencing and short sequence reads, the short reads from both ends of a DNA molecule used for sequencing, which DNA molecule may comprise different DNA fragments, may allow coupling of DNA fragments that were ligated. This is because two sequence reads can be coupled spanning a relatively large DNA sequence relative to the sequence that was determined from both ends. This way, contigs may be built for the DNA fragments.
However, using short reads may be contemplated without identifying DNA fragments, because from the short sequence reads a genomic region of interest may be built, especially when the genomic region of interest has been amplified. Information regarding DNA fragments and/or separate genomic regions of interest (for instance of a diploid cell) may be lost, but DNA mutations may still be identified.
Thus, the step of determining at least part of the sequence of the DNA sequence, may comprise short sequence reads, but preferably longer sequence reads are determined such that DNA fragment sequences may be identified. In addition, it may also be contemplated to use different high throughput sequencing strategies for the DNA fragments, e.g. combining short sequence reads from paired end sequencing with the ends relatively far apart with longer sequence reads, this way, contigs may be build for the DNA fragments.
When analyzing (short) sequence reads, it may be of interest to prevent sequencing the primers used in the enrichment step. Thus, in an alternative embodiment of the methods described herein, the primer sequence may be removed prior to the sequencing step h) (e.g. the high throughput sequencing step).
In the methods of the invention, from determined sequences generated in step h), a contig may be built of the genomic region of interest. When sequences of the DNA fragments are determined, overlapping reads may be obtained from which the genomic region of interest may be built. By increasing the sample size, e.g. increasing the number of cells analysed, the reliability of the genomic region of interest that is built may be increased.
Thus, in some embodiments, the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step h).
As used herein, the term "contig" is used in connection with DNA sequence analysis, and refers to reassembled contiguous stretches of DNA derived from two or more DNA fragments having contiguous nucleotide sequences. Thus, a contig may be a set of overlapping DNA fragments that provides a (partial) contiguous sequence of a genomic region of interest. A contig may also be a set of DNA fragments that, when aligned to a reference sequence, may form a contiguous nucleotide sequence. For example, the term "contig" encompasses a series of (ligated) DNA fragment(s) which are ordered in such a way as to have sequence overlap of each (ligated) DNA fragment(s) with at least one of its neighbours. The linked or coupled (ligated) DNA fragment(s), may be ordered either manually or, preferably, using appropriate computer programs such as FPC, PHRAP, CAP3 etc, and may also be grouped into separate contigs.
Alternatively, when in step b) a plurality of subsamples is generated, using different restriction enzymes or site-directed nucleases, overlapping reads will also be obtained. By increasing the plurality of subsamples, the number of overlapping fragments will increase, which may increase the reliability of the contig of the genomic region of interest that is built. From these determined sequences which may overlap, a contig may be built. Alternatively, if sequences do not overlap, e.g. when a single restriction enzyme may have been used in step b), alignment of DNA fragments with a reference sequence may allow to build a contig of the genomic region of interest.
In some embodiments, when the cell ploidy of the genomic region of interest is greater than 1 , a contig is built for each ploidy.
In some embodiments, the step of building a contig comprises the steps of:
1) identifying the fragments of step b);
2) assigning the fragments to a genomic region;
3) building a contig for the genomic region.
In some embodiments, the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step d) and coupling of the different ligation products to the identified fragments.
In one embodiment, the invention may be used to provide for quality control of generated sequence information. In the analysis of the sequences as provided by a method of high throughput sequencing, sequencing errors may occur. A sequencing error may occur for example during the elongation of the DNA strand, wherein an incorrect (i.e. non- complementary to the template) base is incorporated in the DNA strand. A sequencing error is different from a mutation, as the original DNA which is amplified and/or sequenced would not comprise that incorrect base. According to the invention, DNA fragment sequences may be determined, with (at least part of) sequences of DNA fragments ligated thereto, which sequences may be unique. The uniqueness of the ligated DNA fragments as they are formed in step c) may provide for quality control of the determined sequence in step h). When undigested DNA fragments are amplified, and sequenced at a sufficient depth, multiple copies of the same unique (ligated) DNA fragment(s) will be sequenced. Sequences of copies that originate from the same original undigested DNA fragment may be compared and amplification and/or sequencing errors may be identified.
Size selection
Prior to or after the enrichment step g), according to the methods of the invention, a size selection step may be performed. Such a size selection step may be performed using gel extraction chromatography, gel electrophoresis or density gradient centrifugation, which are methods generally known in the art. Preferably, DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs. A size separation step allows to select for (amplified) ligated DNA fragments in a size range that may be optimal for PCR amplification and/or optimal for the sequencing of long reads by next generation sequencing. Sequencing of reads of >1000 nucleotides is currently commercially available, recent advances by companies such as the Single Molecule Real Time (SMRT™) DNA Sequencing technology developed by Pacific Biosciences (http://www.pacificbiosciences.com/) indicate that reads of beyond 10,000 nucleotides are possible. Using nanopore sequencing even longer reads are generated (https://nanoporetech.com/).
As used herein, “size selection” involves techniques with which particular size ranges of molecules, e.g. (ligated) DNA fragments or amplified (ligated) DNA fragments, are selected. Techniques that can be used are for instance gel electrophoresis, size exclusion, gel extraction chromatography, but are not limited thereto, as long as molecules with a particular size can be selected, such a technique will suffice.
Further fragmentation
In some embodiments, the ligated DNA fragments generated in step c) may be further fragmented prior to or after the non-selective amplification of step e). Accordingly, the ligated DNA fragments generated in step c) may be further fragmented prior to the separation performed in step f).
The fragmenting step b) and the optional further fragmenting step may be aimed at obtaining ligated DNA fragments of a size which is compatible with the subsequent enrichment step (e.g. amplification step) and/or sequence determination step. In addition, a further fragmenting step, preferably with an enzyme, may result in ligated fragment ends which are compatible with the optional ligation of an adaptor. The further fragmenting step may be performed after reversing the crosslinking, however, it is also possible to perform the further fragmenting step and/or ligation step while the DNA fragments are still crosslinked.
At least one adaptor may be ligated to the obtained ligated DNA fragments generated in the further fragmenting step. The ends of the ligated DNA fragments need to be compatible with ligation of such an adaptor. As the ligated DNA fragments may be linear DNA, ligation of an adaptor may provide for a primer hybridisation sequence. The adaptor sequence ligated with ligated DNA fragments comprising the target nucleotide sequence will provide for DNA molecules which may be amplified using PCR as described herein.
Ligated adapter sequences can also be used as described herein to prevent exonuclease based digestion.
Preferably, the DNA is further fragmented with a restriction enzyme or site-directed nuclease as described herein.
If both the fragmentation performed in step b) and further fragmenting steps comprise the use of restriction enzymes or site-directed nucleases, the recognition sequence of fragmentation step b) may be longer than the recognition sequence of the further fragmentation step. The enzyme of step b) thus cuts at a lower frequency than the further fragmentation step. This means that the average DNA fragment size of the further fragmentation step is smaller than the average fragment size generated in step b). This way, in fragmenting step b), relatively large fragments are formed, which are subsequently ligated and the second enzyme of the further fragmentation step cuts more frequently than the enzyme of step b).
Thus, in some embodiments, the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting.
Alternatively, in case the first and second fragmenting steps each comprise restriction enzymes, the restriction enzyme recognition site of the second fragmentation step may be longer than the recognition site of the restriction enzyme used in the first fragmentation step. The second enzyme thus cuts at a lower frequency than the first enzyme. This means that the average DNA fragment size after the first fragmentation is smaller than the average fragment size obtained after the second fragmentation step. This way, in the first fragmenting step, relatively small fragments are formed, which are subsequently ligated. As the second restriction enzyme cuts less frequently, most of the DNA fragments may not comprise the restriction recognition site of the second restriction enzyme. Thus, when the ligated DNA fragments are subsequently fragmented in the second fragmentation step, many of the initial DNA fragments may remain intact. This is useful because the combined sequences of the initial DNA fragments may be used to build a contig for the genomic region of interest. If the first fragmenting step is less frequent than the second optional fragmenting step, the result would be that the initial fragment are generally further fragmented, which may result in the loss of relatively large DNA sequences that are useful for building a contig. Thus, irrespective of which method would be used for the first and second optional fragmenting steps, it is preferred that the first fragmenting step is more frequent as compared to the second optional fragmenting step, such that DNA fragments may largely remain intact, i.e. are largely not further fragmented in the second optional fragmentation step.
Further method for determining the sequence of a genomic region of interest
In a third aspect, the invention provides a method for determining the sequence of a genomic region of interest comprising multiple target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample; h) determining at least part of the sequence of the enriched DNA generated from each sample in step g), preferably using high throughput sequencing.
The method of the third aspect of the invention may be performed as described herein with respect to the first aspect of the invention. Thus, the steps of the method of the third aspect of the invention can be carried out as described herein for the corresponding steps of the first aspect of the invention. In addition, all embodiments of the invention described herein with respect to the first aspect of the invention are applicable to the third aspect of the invention.
Identifying mutations
In alternative aspects of the invention, methods are provided for identifying the presence or absence of a genetic mutation.
In a first embodiment, a method is provided for identifying the presence or absence of a genetic mutation, comprising the steps a)-h) of any of methods of the second or third aspects of the invention as described above, wherein contigs are built for a plurality of samples, comprising the further steps of: i) aligning the contigs of a plurality of samples; and j) identifying the presence or absence of a genetic mutation in the genomic regions of interest from the plurality of samples.
Alternatively, a method for identifying the presence or absence of a genetic mutation is provided, comprising the steps a)-h) of any of the methods of the invention as described above, comprising the further steps of: i) aligning the contig to a reference sequence; and j) identifying the presence or absence of a genetic mutation in the genomic region of interest.
Genetic mutations can be identified for instance by comparing the contigs of multiple samples, in case one (or more) of the samples comprises a genetic mutation, this may be observed as the sequence of the contig is different when compared to the sequence of the other samples, i.e. the presence of a genetic mutation is identified. In case no sequence differences between contigs of the samples is observed, the absence of genetic mutation is identified. Alternatively, a reference sequence may also be used to which the sequence of a contig may be aligned. When the sequence of the contig of the sample is different from the sequence of the reference sequence, a genetic mutation is observed, i.e. the presence of a genetic mutation is identified. In case no sequence differences between the contig of the sample or samples and the reference sequence is observed, the absence of genetic mutation is identified.
It is not required to build a contig for identifying the presence or absence of a genetic mutation. As long as DNA fragments sequences may be aligned, with each other or with a reference sequence, the presence or absence of a genetic mutation may be identified. Thus, in alternative embodiments of the invention, a method is provided for identifying the presence or absence of a genetic mutation, according to any of the methods as described above, without the further step of building a contig.
Such a method comprises the steps a)-h) of any of the methods as described above and the further steps of: i) aligning the determined sequences of the (amplified) undigested DNA fragments generated in step h) to a reference sequence; and j) identifying the presence or absence of a genetic mutation in the determined sequences.
Alternatively, a method is provided for identifying the presence or absence of a genetic mutation, wherein of a plurality of samples sequences of (amplified) undigested DNA fragments are determined, comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) aligning the determined sequences (generated in step h)) of the (amplified) undigested DNA fragments of a plurality of samples; j) identifying the presence or absence of a genetic mutation in the determined sequences.
Ratio of alleles or cells carrying a genetic mutation or transgene
As already mentioned above, when a sample of crosslinked DNA is provided from heterogeneous cell populations (e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)), for each genomic region of interest corresponding to different genomic environment (which may e.g. be different genomic environments from different alleles in a cell or different genomic environments from different cells) contigs may be built. In addition, the ratio of fragments or ligation products carrying an allele, transgene or genetic mutation may be determined, which may correlate to the ratio of alleles or cells carrying the genetic mutation or the transgene. Since the ligation of DNA fragments is a random process, the collection and order of DNA fragments that are part of the ligation products may be unique and represent a single cell and/or a single genomic region of interest from a cell.
Thus, identifying ligation products comprising the fragment with the allele, genetic mutation or transgene may also comprise identifying ligation products with a unique order and collection of DNA fragments. The ratio of alleles or cells carrying a genetic mutation or transgene may be of importance in evaluation of therapies, e.g. in case patients are undergoing therapy for cancer, such as gene therapy. Cancer cells may carry a particular genetic mutation or cells may carry a particular transgene. The percentage of cells carrying such a mutation or the transgene may be a measure for the success or failure of a therapy. In alternative embodiments, methods are provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene, and/or the ratio of ligation products carrying a genetic mutation. In this embodiment, a genetic mutation is defined as a particular genetic mutation or a selection of particular genetic mutations.
In one aspect, a method is provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) identifying the fragments of step b); j) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; k) determining the number of fragments carrying the allele, genetic mutation or transgene; l) determining the number of fragments not carrying the allele, genetic mutation or transgene;
I) calculating the ratio of fragments carrying the allele, genetic mutation or transgene.
In another aspect, a method is provided for determining the ratio of ligation products carrying a fragment with an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-h) of any of the methods as described above, comprising the further steps of: i) identifying the fragments of step b); j) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; k) identifying the ligation products of step c) carrying the fragments with or without the allele, genetic mutation or transgene; I) determining the number of ligation products carrying the fragments with the allele, genetic mutation or transgene; m) determining the number of ligation products carrying the fragments without the allele, genetic mutation or transgene; n) calculating the ratio of ligation products carrying the allele, genetic mutation or transgene.
In the methods of these embodiments, the presence or absence of an allele, genetic mutation or transgene may be identified in step j) by aligning to a reference sequence and/or by comparing DNA fragment sequences of a plurality of samples.
In the methods according to the invention, an identified genetic mutation may be a SNP, single nucleotide polymorphism, an insertion, an inversion and/or a translocation. In case a deletion and/or insertion is observed, the number of fragments and/or ligation products from a sample carrying the deletion and/or insertion may be compared with a reference sample in order to identify the deletion and/or insertion. A deletion, insertion, inversion and/or translocation may also be identified based on the presence of chromosomal breakpoints in analyzed fragments.
In another embodiment, in the methods as described above, the presence or absence of methylated nucleotides is determined in DNA fragments, ligated DNA fragments, and/or genomic regions of interest. For example, the DNA of step a)-g) may be treated with bisulphite. Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5- methylcytosine residues unaffected. Thus, bisulphite treatment introduces specific changes in the DNA sequence that depend on the methylation status of individual cytosine residues, yielding single- nucleotide resolution information about the methylation status of a segment of DNA. By dividing samples into subsamples, wherein one of the samples is treated, and the other is not, methylated nucleotides may be identified. Alternatively, sequences from a plurality of samples treated with bisulphite may also be aligned, or a sequence from a sample treated with bisulphite may be aligned to a reference sequence.
This disclosure is not limited by the exemplary methods and materials disclosed herein, and any methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of this disclosure. Numeric ranges are inclusive of the numbers defining the range. Unless otherwise indicated, any nucleic acid sequences are written left to right in 5' to 3' orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively. Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within this disclosure. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within this disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in this disclosure.
It must be noted that as used herein and in the appended claims, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise.
The terms "comprising", "comprises" and "comprised of as used herein are synonymous with "including", "includes" or "containing", "contains", and are inclusive or open-ended and do not exclude additional, non-recited members, elements or method steps. The terms "comprising", "comprises" and "comprised of' also include the term "consisting of'.
The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that such publications constitute prior art to the claims appended hereto.
The invention will now be further described by way of Examples, which are meant to serve to assist one of ordinary skill in the art in carrying out the invention and are not intended in any way to limit the scope of the invention.
EXAMPLES
Example 1 : Illustrative example of targeted sequencing of rare integrated viruses and integration sites
This is an example of an approach to preferentially sequence integrated copies of a virus vector sequence in a human cell sample that contains a limited number of integration sites.
Fixation and cell lysis
Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysis buffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1 % TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubated for 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
Fragmenting 1
In a first fragmentation step, the fixated lysed cells are digested with a restriction enzyme targeting a restriction site sequence which occurs 5 times in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme results in the generation of seven fragments. Target nucleotide sequences are chosen in proximity to each of the restriction sites.
Ligating 1
The restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009).
Reversing cross-linking
To the sample, Prot K (10 mg/ml) is added and incubated at 65°C. RNase A (10 mg/ml, Roche #10109169001) is subsequently added and the sample is incubated at 37°C. Next, phenolchloroform extraction is performed, and the supernatant comprising the DNA is precipitated and pelleted. The pellet is dissolved in 10 mM Tris-HCI pH 7.5.
Non-selective amplification
Non-selective amplifications of the ligation product is performed using multiple strand displacement (MSD) amplification (Telenius etal. (1992) Genomics 13(3) 718-725). The MSD technique employs a unique and highly processive mesophillic DNA polymerase, phi29. The resulting product consists of long, 10-50 kb, fragments, good amplification and representation.
Fragmenting 2
The sample is digested with a fragmenting strategy that, on average, will result in fragments of around 3000 bp in size (as a result of which the resulting circularised DNA can be amplified effectively).
Circularisation
The resulting linear DNA is circularised with a ligase enzyme. Physical separation
The product of the non-selective amplification is separated in 7 separate tubes.
Amplifying ligated DNA fragments: PCR
Separate amplifications are performed with seven primer pairs specific for the seven viral restriction fragments generated in the first fragmentation step.
The primers used for the PCR-enrichment are designed as inverted unique primers specific for the target nucleotide sequence.
This results in the amplification of the remaining circularized DNA which consists of ligation events originating from integrated copies of the virus.
Sequencing the amplified ligated DNA fragments
The amplified DNA can be library prepped and sequenced according to standard protocols.
Example 2 - Sequencing of BRCA gene in a FFPE tumour sample
Deparaffinization buffer: 1x CutSmart buffer (Invitrogen) containing 0.02% Igepal 10x
Ligation buffer: 660 mM Tris pH 7.5, 50 mM MgCI2, 50 mM DTT, 10 mM ATP
Starting material
Starting material of the procedure are single 2-10 pm coupes of regular diagnostic FFPE material, either as a scroll in a 1.5 ml Eppendorf tube or on a diagnostic microscope slide. Material from a slide should be scraped off and transferred to a 1 .5 ml Eppendorf tube before starting.
FFPE Deparaffinization 1 Samples are heating in Deparaffinization buffer for 3 min at 80°C while shaking at 900 RPM, before centrifugation for 2 min, 10.000xg at 40°C. The paraffin layer is removed using a pipet tip. The heating to paraffin layer removal steps are then repeated. SDS is added to the supernatant and the tissue is transfered with the buffer to a 130 pl Covaris Screw Cap microTUBE. Samples are sonicated on a Covaris M220 for 300 seconds, Duty factor 20%, Power 75 Watts, 200 cycles/burst at 20°C. Samples are transferred back to the same 1.5 ml Eppendorf tube. CutSmart buffer is added before incubation for 2 hours at 80°C while shaking at 900 RPM. Triton X-100 is then added before incubation for 30 minutes at 37°C while shaking at 900 RPM.
Digestion 100U Niall I (New England Biolabs) is added and the sample incubated 1 hour at 37°C while shaking at 900 RPM. 16. Inactivate Nlalll for 25 minutes at 65°C.
Ligation
The sample is cooled to room temperature (RT). Ligation buffer, T4 DNA ligase and deionized water are then added and the sample incubated for 2 hours at RT while tumbling. NaCI, SDS and Proteinase K (Roche) are then added before incubating for 1 hour at 56°C and 16 hours at 80°C.
Purification
DNA is purified using NucleoMag P-beads (Marcherey Nagel)
Non-selective amplification
Non selective amplifications of the ligation product is performed using the RepliG FFPE kit using manufacturer’s instructions.
Universal primer ligation
Amplicons are fragmented to an average size of 2kb. Universal adaptors are ligated to generated fragments using manufacturer’s instructions.
Physical separation
The product of the non-selective amplification, fragmentation and adaptor ligation is separated in 10 separate tubes.
Amplifying ligated DNA fragments: multiplex linear PCR
Separate amplifications are performed with 10 different primer sets consisting of universal primers specific for the adaptors ligated previously and sets of 10 primers specific for different restriction fragments of the BRCA gene resulting from the first fragmentation step. The primers used in each individual multiplex are spaced evenly across the BRCA gene (i.e. of the 100 primers used in all amplifications, the first mix represents the 1st, 11th, 21st, etc. position across the gene of interest, the second mix the 2nd, 12th, 22nd, etc.)
This will result in the amplification of the BRCA gene.
Sequencing the amplified ligated DNA fragments The amplified DNA can be library prepped and sequenced according to standard protocols.
All publications mentioned in the above specification are herein incorporated by reference. Various modifications and variations of the described methods and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in molecular biology or related fields are intended to be within the scope of the following claims.

Claims

1. A method for enriching a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample.
2. A method for making a DNA sequencing library of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample;
46 g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample; h) optionally, determining at least part of the sequence of the enriched DNA generated from each sample in step g), preferably using high throughput sequencing.
3. A method for determining the sequence of a genomic region of interest comprising a plurality of target nucleotide sequences, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) performing a non-selective amplification of the ligated DNA generated in step d); f) separating the amplified, ligated DNA generated in step e) into at least a first sample and a second sample; g) enriching at least the first and the second samples of amplified, ligated DNA generated in step f) for DNA comprising at least one of the plurality of target nucleotide sequences; wherein the amplified, ligated DNA in the first sample is enriched for at least one target nucleotide sequence which is not enriched for in the second sample and the amplified, ligated DNA in the second sample is enriched for at least one target nucleotide sequence which is not enriched for in the first sample; i) determining at least part of the sequence of the enriched DNA generated from each sample in step g), preferably using high throughput sequencing.
4. The method according to any preceding claim wherein step f) comprises separating the amplified, ligated DNA generated in step e) into at least 3, at least 4, at least 5, at least 8 or at least 10 samples.
47
5. The method according to any preceding claim wherein step g) comprises performing a singleplex enrichment using a single target nucleotide sequence of the plurality of target nucleotide sequences in each separate sample from step f).
6. The method according to any of claims 1 to 4 wherein step g) comprises performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in one or more of the separate samples from step f).
7. The method according to claim 6 wherein step g) comprises performing a multiplex enrichment using at least two target nucleotide sequences of the plurality of target nucleotide sequences in each of the separate samples from step f).
8. The method according to claim 6 or 7 wherein step g) comprises performing a multiplex enrichment using at least two, at least three, at least four, at least five, at least eight or at least ten target nucleotide sequences.
9. The method according to any preceding claim, wherein the enrichment in step g) is performed by PCR, capture-based enrichment and/or site-directed nuclease digestion.
10. The method according to any preceding claim wherein the amplified, ligated DNA is circularised after step e) and prior to step f) or step g); and step g) is performed by PCR enrichment using inverse primer pairs.
11 . The method according to any of claims 1 to 9 wherein universal adapters are ligated to the amplified, ligated DNA after step e) and prior to step f) or step g); and step g) is performed by PCR enrichment with primer pairs complementary to a target nucleotide sequence and the universal adapter.
12. The method according to any preceding claim, wherein the fragmentation in step b) is performed by sonication, optionally followed by enzymatic DNA end repair.
13. The method according to any of claims 1 to 11 , wherein the fragmentation in step b) is non-random fragmentation of the DNA at a recognition sequence.
14. The method according to claim 13, wherein the non-random fragmentation comprises fragmenting with a restriction enzyme.
15. The method according to claim 13 or 14 in which the target nucleotide sequences used in separate enrichments originate from adjacent restriction fragments generated in the nonrandom fragmentation.
48
16. The method of any preceding claim, wherein the ligation in step c) is performed in the presence of an adaptor, ligating adaptors sequences in between fragments.
17. The method according to any preceding claim, wherein the method further comprises the step of: (i) further fragmenting the ligated DNA generated in step d) prior to step e) and/or (ii) further fragmenting the amplified, ligated DNA generated in step e) prior to step f).
18. The method of claim 17, wherein the fragmentation in step b) and the further fragmentation in step (i) and/or (ii) are each performed as a non-random fragmentation of the DNA at a recognition sequence; wherein the recognition sequence of the first fragmentation step b) is shorter than the recognition sequence of the further fragmentation step.
19. The method of claim 17 or 18, wherein the fragmentation in each step b) and step (i) and/or (ii) are each performed with a restriction enzyme.
20. The method according to any preceding claim wherein the cross-linked sample is a tissue sample.
21 . The method according to claim 20 wherein the tissue sample is a tumour sample.
22. The method according to any preceding claim wherein the cross-linked sample is a Formalin-Fixed Paraffin-Embedded (FFPE) sample.
23. The method according to any preceding claim, wherein the sequences of a plurality of genomic regions of interest are determined.
24. The method according to any preceding claim in which target nucleotide sequences used in separate enrichments originate from genomic positions within a 1 kb, 2kb or 10kb physical distance.
25. The method according to any preceding claim in which target nucleotide sequences used in separate enrichments originate from different DNA strands in the genomic region of interest.
PCT/EP2022/071761 2021-08-03 2022-08-02 Method WO2023012195A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB2111194.3A GB202111194D0 (en) 2021-08-03 2021-08-03 Method
GB2111194.3 2021-08-03

Publications (1)

Publication Number Publication Date
WO2023012195A1 true WO2023012195A1 (en) 2023-02-09

Family

ID=77651244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/071761 WO2023012195A1 (en) 2021-08-03 2022-08-02 Method

Country Status (2)

Country Link
GB (1) GB202111194D0 (en)
WO (1) WO2023012195A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007004057A2 (en) 2005-07-04 2007-01-11 Erasmus University Medical Center Chromosome conformation capture-on-chip (4c) assay
WO2008084405A2 (en) * 2007-01-11 2008-07-17 Erasmus University Medical Center Circular chromosome conformation capture (4c)
WO2012005595A2 (en) 2010-07-09 2012-01-12 Wouter Leonard De Laat V3-d genomic region of interest sequencing strategies
WO2012055595A1 (en) 2010-10-29 2012-05-03 Robert Bosch Gmbh Electromechanical flywheel
WO2019109086A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Methods and systems for determining somatic mutation clonality
WO2019148119A1 (en) 2018-01-29 2019-08-01 St. Jude Children's Research Hospital, Inc. Method for nucleic acid amplification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007004057A2 (en) 2005-07-04 2007-01-11 Erasmus University Medical Center Chromosome conformation capture-on-chip (4c) assay
WO2008084405A2 (en) * 2007-01-11 2008-07-17 Erasmus University Medical Center Circular chromosome conformation capture (4c)
WO2012005595A2 (en) 2010-07-09 2012-01-12 Wouter Leonard De Laat V3-d genomic region of interest sequencing strategies
WO2012055595A1 (en) 2010-10-29 2012-05-03 Robert Bosch Gmbh Electromechanical flywheel
WO2019109086A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Methods and systems for determining somatic mutation clonality
WO2019148119A1 (en) 2018-01-29 2019-08-01 St. Jude Children's Research Hospital, Inc. Method for nucleic acid amplification

Non-Patent Citations (19)

* Cited by examiner, † Cited by third party
Title
ALBERT L. LEHNINGER: "Principles of Biochemistry", 1982, ACADEMIC PRESS, pages: 793 - 800
ALIGN 2: "United States Copyright Office", 10 December 1991, GENENTECH, INC., pages: 20559
ANNETTE DENKER ET AL: "The second decade of 3C technologies: detailed insights into nuclear organization", 1 January 2016 (2016-01-01), pages 1357 - 1382, XP055340371, Retrieved from the Internet <URL:http://genesdev.cshlp.org/content/30/12/1357.full.pdf> [retrieved on 20170130], DOI: 10.1101/gad.281964.116 *
AUSUBEL ET AL.: "Current Protocols in Molecular Biology", 1987, JOHN WILEY & SONS
HARD ET AL., BIORXIV 439527, 2021
JINEK M, SCIENCE, vol. 337, 2012, pages 816 - 821
KIM H, NATURE COMMUNICATIONS, vol. 8, 2017, pages 14406
KITTLER ET AL., ANALYTICAL BIOCHEMISTRY, vol. 300, 2002, pages 237 - 244
KITTLER R ET AL: "A whole genome amplification method to generate long fragments from low quantities of genomic DNA", ANALYTICAL BIOCHEMISTRY, ACADEMIC PRESS, AMSTERDAM, NL, vol. 300, no. 2, 15 January 2002 (2002-01-15), pages 237 - 244, XP002296223, ISSN: 0003-2697, DOI: 10.1006/ABIO.2001.5460 *
LANGMORE, PHARMACOGENOMICS, vol. 3, 2002, pages 557 - 560
LANNONE ET AL., CYTOMETRY, vol. 39, 2000, pages 131 - 140
MATTHEW W. SNYDER ET AL: "Haplotype-resolved genome sequencing: experimental methods and applications", NATURE REVIEWS GENETICS, vol. 16, no. 6, 15 June 2015 (2015-06-15), GB, pages 344 - 358, XP055345555, ISSN: 1471-0056, DOI: 10.1038/nrg3903 *
MAZOYER, HUMAN MUTATION, vol. 25, 2005, pages 415 - 422
SAMBROOK ET AL.: "Molecular Cloning. A Laboratory Manual", 1989, COLD SPRING HARBOR LABORATORY PRESS
SONG ET AL., NUCLEIC ACIDS RESEARCH, vol. 48, no. 4, 2020, pages e19
SUNGALEE ET AL., NATURE GENETICS, vol. 53, 2021, pages 650 - 662
TARASOVA ET AL., BMC MOL. BIOL., vol. 9, 2008, pages 7
TELENIUS ET AL., GENOMICS, vol. 13, no. 3, 1992, pages 718 - 725
VREE, NATURE BIOTECHNOLOGY, vol. 32, 2014, pages 1019 - 1025

Also Published As

Publication number Publication date
GB202111194D0 (en) 2021-09-15

Similar Documents

Publication Publication Date Title
AU2011274642B2 (en) 3-d genomic region of interest sequencing strategies
JP5806213B2 (en) Probes for specific analysis of nucleic acids
EP3475449B1 (en) Uses of a cell-free nucleic acid standards
WO2013192292A1 (en) Massively-parallel multiplex locus-specific nucleic acid sequence analysis
US20180355417A1 (en) Rare nucleic acid detection
US20160040228A1 (en) Sequencing strategies for genomic regions of interest
US20220325317A1 (en) Methods for generating a population of polynucleotide molecules
KR20230124636A (en) Compositions and methods for highly sensitive detection of target sequences in multiplex reactions
WO2023012195A1 (en) Method
US11268087B2 (en) Isolation and immobilization of nucleic acids and uses thereof
WO2021224233A1 (en) Method
JP2024035110A (en) Sensitive method for accurate parallel quantification of mutant nucleic acids
WO2021224225A1 (en) Method
WO2023150640A1 (en) Methods selectively depleting nucleic acid using rnase h
US20180282799A1 (en) Targeted locus amplification using cloning strategies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22761468

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE