WO2022036977A1 - 双端文库标签组合物及其在mgi测序平台中的应用 - Google Patents

双端文库标签组合物及其在mgi测序平台中的应用 Download PDF

Info

Publication number
WO2022036977A1
WO2022036977A1 PCT/CN2020/139919 CN2020139919W WO2022036977A1 WO 2022036977 A1 WO2022036977 A1 WO 2022036977A1 CN 2020139919 W CN2020139919 W CN 2020139919W WO 2022036977 A1 WO2022036977 A1 WO 2022036977A1
Authority
WO
WIPO (PCT)
Prior art keywords
library
tag
tags
double
ended
Prior art date
Application number
PCT/CN2020/139919
Other languages
English (en)
French (fr)
Inventor
汪彪
胡玉刚
吴强
Original Assignee
纳昂达(南京)生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 纳昂达(南京)生物科技有限公司 filed Critical 纳昂达(南京)生物科技有限公司
Priority to EP20947818.9A priority Critical patent/EP3998343B1/en
Priority to JP2023511829A priority patent/JP2023538561A/ja
Publication of WO2022036977A1 publication Critical patent/WO2022036977A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B70/00Tags or labels specially adapted for combinatorial chemistry or libraries, e.g. fluorescent tags or bar codes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1093General methods of preparing gene libraries, not provided for in other subgroups
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms

Definitions

  • the invention relates to the field of plasma DNA library construction, in particular, to a double-ended library tag composition and its application in an MGI sequencing platform.
  • each sample needs to be labeled and sequenced with a different index sequence (Index) and then split.
  • index sequence index sequence
  • the current MGI sequencing platforms basically use single-end tagged libraries. Due to the natural defects of single-ended tags (Index), it is easy to cause crosstalk between samples. Due to the contamination caused by tag adapters or primers in the synthesis, experimental operation and sequencing, mutual crosstalk is inevitable. Therefore, it is necessary to solve the low-frequency crosstalk between samples. The best way at present is to use double-ended tags. The method of double-ended labeling can effectively remove the crosstalk between samples.
  • the main purpose of the present invention is to provide a double-end library tag composition and its application in an MGI sequencing platform, so as to solve the problem that the existing MGI sequencing platform uses a single-end tag to label the library easily to generate sample crosstalk.
  • a double-ended library tag composition comprising: a plurality of 5'-end library tags and a plurality of 3'-end library tags, a plurality of The lengths of the library tags at the 5' end are all the same, and the lengths of the library tags at the 3' ends are all the same, and in the paired-end library tag composition, the number of occurrences of each base at the same position is the same.
  • the lengths of the library tags at the 5' ends are the same as the lengths of the library tags at the 3' ends, preferably any fixed length between 6 and 10 bp; preferably, in the double-ended library tag composition, any There are at least 3 base differences between the two library tags; and the number of consecutive identical bases in any one library tag does not exceed 3; preferably, the GC content of any one library tag is 40-60%; preferably , the double-ended library tag composition includes a combination of 4-tag balanced double-ended library tags, or a combination of 8-tag balanced double-ended library tags, wherein the combination of 4-tag balanced double-ended library tags is a library of 4n 5' ends
  • the combination of a tag and 4n library tags at the 3' end, and the combination of an 8-tag balanced paired-end library tag is a combination of 8n library tags at the 5' end and 8n library tags at the 3' end, where n is greater than or equal to 1 Natural number.
  • the library tag at the 5' end is selected from any one or more of the 96 groups shown in Table 1, and the library tag group at the 3' end is selected from the group shown in Table 1. Any one or more of the 96 sets of library tags that differ from the 5' end set.
  • the library tag at the 5' end is selected from any one or more of the 48 groups shown in Table 2, and the library tag group at the 3' end is selected from the group shown in Table 2. Any one or more of the 48 sets of 5'-end library tag sets.
  • an amplification primer composition with double-ended library tags based on an MGI sequencing platform comprising a plurality of amplification primer pairs with double-ended library tags
  • each amplification primer pair includes a library tag at the 5' end and a library tag at the 3' end, the lengths of the library tags at the 5' end of multiple amplification primer pairs are all the same, and the library tags at the 3' end of multiple amplification primer pairs are all of the same length. are the same length, and each base occurs the same number of times at the same position.
  • the lengths of the library tags at the 5' ends of the multiple amplification primer pairs are the same as the lengths of the library tags at the 3' ends of the multiple amplification primer pairs; preferably, the lengths of the library tags at the 5' ends and the library tags at the 3' ends are the same lengths.
  • n is a natural number greater than or equal to 1.
  • the library tag at the 5' end is selected from any one or more of the 96 groups shown in Table 1, and the library tag group at the 3' end is selected from the group shown in Table 1.
  • the 96 groups are different from any one or more groups of the library tag groups at the 5' end; preferably, in the 8n amplification primer pairs balanced by the 8 tags, the library tags at the 5' end are selected from the 48 groups shown in Table 2.
  • the library tag group at the 3' end is selected from any one or more groups of the 48 groups shown in Table 2 that are different from the library tag group at the 5' end.
  • each amplification primer pair also includes a 5'-end universal amplification sequence and a 3'-end universal amplification sequence
  • the 5'-end universal amplification sequence includes a universal sequence located upstream of the library tag at the 5' end and a universal sequence located at the 5' end.
  • the universal sequence downstream of the library tag, the universal amplified sequence at the 3' end includes the universal sequence located upstream of the library tag at the 3' end and the universal sequence located downstream of the library tag at the 3' end; preferably, the universal sequence located upstream of the library tag at the 5' end It is SEQ ID NO:793, the universal sequence located downstream of the library tag at the 5' end is SEQ ID NO:794; the universal sequence located upstream of the library tag at the 3' end is SEQ ID NO:795, and the universal sequence located downstream of the library tag at the 3' end is SEQ ID NO:795.
  • the sequence is SEQ ID NO: 796; or
  • the general sequence upstream of the library tag at the 5' end is SEQ ID NO:793, and the general sequence downstream of the library tag at the 5' end is SEQ ID NO:797; the general sequence upstream of the library tag at the 3' end is SEQ ID NO:795 , the generic sequence downstream of the library tag at the 3' end is SEQ ID NO:798.
  • a sequencing library construction kit comprising any of the above-mentioned amplification primer compositions.
  • the kit also includes a bubble joint, the bubble joint includes a first linker sequence and a second linker sequence, the first linker sequence is SEQ ID NO:769, the second linker sequence is SEQ ID NO:770, or the first linker sequence
  • the linker sequence is SEQ ID NO:773 and the second linker sequence is SEQ ID NO:774.
  • a method for constructing a sequencing library based on an MGI sequencing platform is provided, and the method is constructed by using the above-mentioned kit.
  • a sequencing library comprising the above-mentioned paired-end library tag combination, or any of the above-mentioned amplification primer combinations.
  • each 5'-end library tag in the double-end library tag combination controls the length of each 5'-end library tag in the double-end library tag combination to be the same, the length of the 3'-end library tag is also the same, and the number of occurrences of each base at the same position is defined to be the same, so that the composition is
  • the bases of the double-end tags in the composition have the same probability of occurrence, so when synthesizing adapters or library amplification primers with the double-end tags in the composition, multiple libraries with good base balance of the double-end library tags can be obtained.
  • the double-end tag reading accuracy of each library in these mixed libraries can be made high, thereby improving the effective splitting rate of the library.
  • Figure 1A, Figure 1B, and Figure 1C illustrate the advantages of MGI sequencing platforms using double-ended tags over single-ended tags to remove crosstalk
  • FIGS. 2A and 2B illustrate a single-ended tag linker for MGI
  • FIGS. 3A and 3B show double-ended tag linkers for MGI
  • Fig. 4 shows the realization process of two kinds of double-ended tags on MGI platform
  • FIG. 5 shows that the double-end tagging scheme of the present application is compatible with the single-end tagging amplicon scheme
  • Figure 6 shows that the double-ended tag amplification primer of the present application is compatible with a single-ended tag molecular tag adapter
  • FIGS 7A and 7B show 4-balanced and 8-balanced tag sequence base-balanced patterns
  • Figure 8 shows the comparison of base balance between 4-equilibrium and 8-equilibrium in the multi-hybrid process
  • FIG. 9 shows the output comparison of two database construction schemes
  • Figure 10 shows the differences in data split between 4 and 8 equilibria during 12 hybrid sequencing.
  • Double-ended tag adapters For high-throughput sequencing, the end of each fragment needs to be connected to a universal sequencing adapter. Each of the non-complementary regions of the adapter has a variable sequence region. The sequence is a tag sequence, which is used to split data during sequencing.
  • DNA sequence consists of four bases, namely A, T, G and C.
  • A, T, G and C bases, namely A, T, G and C.
  • a set of tag sequences is combined to ensure the proportion of bases in each position of the tag sequence. equal.
  • the best way to solve the crosstalk between samples is to introduce a double-ended tag sequence in the process of building the library, as shown in Figure 1B, to solve the problem of crosstalk can only be controlled as much as possible.
  • the double-ended tagging scheme as shown in Figure 1C reduces crosstalk by a factor of 100 (1% to 0.01%) compared to the single-ended scheme.
  • the present application also attempts to solve the problem by changing the existing single-end tags of MGI to double-end tags.
  • the specific development ideas and process are as follows:
  • the library construction scheme of MGI is to use a bubble-shaped linker.
  • the single-end tag of MGI can be fused into the linker (as shown in Figure 2B), or it can be a separate scheme (as shown in Figure 2A);
  • the sequence of the end tag cannot be fused with the front-end sequence (as shown in Figure 3B, if the tag sequence is fused at the front-end, since the front-end complementary region is only 7bp, the bubble-like structure in the middle will be longer, the stability of this structure is extremely poor, and the realization efficiency is very poor.
  • the realization effect is not as good as the truncated solution with the separation of the index sequence primer and the universal linker), and only the amplification primer structure form of the universal linker and the separated double-end label can be used (as shown in Figure 3A).
  • the inventor connects double-ended tags according to the structural form shown in Figure 3A, but in the process of practical application, it is found that the bubble shape in the middle of the bubble-shaped joint is too large, which will affect the stability of the annealed secondary structure, and the poor annealing will affect the joint connection efficiency (average connection efficiency). 20%-40%).
  • MGI's bubble adapters differ from Illumina's Y adapters in that double-ended tags can be fused together.
  • the unpaired base in the middle region of the bubble joint of MGI can be 30 ⁇ 5bp. At this time, when the paired base is 20 ⁇ 2bp, it is easier to form a stable annealing connection, thereby improving the connection efficiency, as shown in Figure 4.
  • Scheme 1; the unpaired base in the middle region can also be 45 ⁇ 5bp. At this time, when the paired base is 25 ⁇ 2bp, the annealing connection formed is more stable and the connection efficiency is higher, as shown in Scheme 2 in Figure 4.
  • the inventors further found that, compared with the second solution, the first solution has the following advantages.
  • the bubble region is a 30 ⁇ 5bp linker that is annealed and stable, and requires less complementary regions, which is conducive to the connection;
  • it can be Amplicons compatible with single-end tags, the amplicons can be switched between single-end and double-end tags, as shown in Figure 5; the third point is compatible with single-end molecular tag adapters, as shown in Figure 6.
  • scheme 1 has many advantages over scheme 2, if one wants to obtain a sequencing library of the MGI sequencing platform with double-end tags, both scheme 1 and scheme 2 can be achieved. this purpose.
  • the library constructed with double-end tags is further used for on-machine sequencing and the data is split after sequencing, the inventors also found that the base balance requirements of the double-end tag adapters of MGI are stricter than those of single-end tags during sequencing. The data can be split only when the tag sequences at the ends are matched, as shown in Figure 1B. That is to say, although double-ended tags solve the problem of crosstalk of samples, the requirements for base balance of on-board sequencing are extremely strict. Poor base balance will seriously affect the accurate reading of sequencing data, which in turn affects the effective splitting of data.
  • the inventors optimized the base balance of the double-ended tags according to the following rules.
  • the rules of base screening are as follows: 1) Each There is a 3-base difference between the various tag sequences; 2) The GC content of each sequence is controlled between 0.4-0.6; 3) The same number of consecutive bases cannot exceed 3.
  • the secondary structure of each selected tag sequence is evaluated to evaluate whether the tag sequence and the universal primer at the 3' end of the amplification primer form secondary structures such as hairpin folds, which will reduce the amplification of the amplification primer.
  • it also affects the balance of each tag base in the entire mixed-sample library, which further affects the reading accuracy of subsequent tags, thereby reducing the accuracy of sequencing data splitting.
  • the present application optimizes 384 kinds of 4-label balance and 384 kinds of 8-label balance sequences
  • 4-label balance refers to a group of 4 label sequence balances, as shown in Figure 7A (corresponding to the sequence in Table 4).
  • First tags 1-4 a group of 4 tag sequences are located at positions 1 to 10 of the tag, one for each base A, T, G, and C.
  • 4-tag balance refers to the balance of tag sequences in groups of 4, as shown in Figure 7B (corresponding to the first 1-8 tags in Table 5), and the tag sequences in groups of 8 are in positions 1 to 10 of the tag , two of each base A, T, G, and C.
  • the proportion of each base is 0-50%.
  • the proportion of each base after the library tag combination can reach a balance of 25%.
  • the proportion of each base in the library tags of the 8-balanced combination was between 16.7% and 33.3%.
  • the balance of non-integer multiples of 4-balance is also better than the combination of 8-balance, and the application of 4-balance is more conducive to the arrangement of the MGI sequencer.
  • the 384 kinds of 4-balanced tag combinations optimized in this application are used, and each group of four adjacent to the front and back is more conducive to arranging on the machine (see Table 1 for 4-balanced tag combinations). 384 tag sequences).
  • the optimized 384 kinds of 8-balanced label combinations, and each group of eight adjacent to each other, is also conducive to arranging the machine (see Table 2 for 8-balanced 384 kinds of label sequences).
  • the sequence of primer 1 is a forward sequence of 384 numbers; primer 2 is a reverse sequence of 384 numbers, which is only one of the present invention.
  • the recommended arrangement is preferred. In practical applications, the combination arrangement can also be made according to actual needs.
  • primer 1 selects any one of the 96 groups, primer 2 can select any one of the remaining 95 groups.
  • primer 1 selects the first 3 groups, and primer 2 can arbitrarily select 3 groups from the remaining 93 groups.
  • the paired-end library tags can be selected according to this rule.
  • the 4-balanced combination has obvious advantages over the 8-balanced combination; the 4-balanced combination has an advantage over the 8-balanced combination except that the integer multiple of 4 has an advantage over the 8-balanced combination (4, 12, 20) , the non-integer multiple combination is also better than the 8-balanced combination, and the balance is also better than the 8-balanced combination of the same sample mixture sequencing at 4n+1 and 4n+2.
  • the comparison between 4-balance and 8-balance has the following advantages: 1) The combination of 4-balanced samples has twice as many combinations as 8-balanced samples; 2) Among the three groups of unbalanced arrangements, 4n+1 and 4n+ The balance of the combination of 2 is also better than that of the combination of 8; 3) When there is a difference in the amount of sequencing data between samples, the combination of the 4-balance is better arranged to be close to the balance, and the large-data samples are preferentially arranged. In the balanced combination, the small amount of sequencing Samples can be unbalanced.
  • the split rate of the data on the computer for a balanced group will be higher, because the sequencer reads the bases composed of the balanced composition more accurately, and the unbalanced bases will read errors and reduce the data split rate.
  • the 4-balanced and 8-balanced tag sequences were used to build the library for sequencing. From the data splitting results, as shown in Figure 10, 4-balanced tag sequences, 12 samples of The data splitting does not fluctuate much. For the 8-balanced label sequence, some samples will be significantly reduced in the data splitting of 12 samples.
  • a double-ended library tag composition in a typical embodiment of the present application, includes: a plurality of 5'-end library tags and a plurality of 3'-end library tags, a plurality of The lengths of the library tags at the 5' end are all the same, and the lengths of the library tags at the 3' ends are all the same, and in the paired-end library tag composition, the number of occurrences of each base at the same position is the same.
  • the double-ended library tag composition provided by the present application, by limiting the length of each 5'-end library tag in the combination is the same, the length of the 3'-end library tag is also the same, and the number of occurrences of each base at the same position is limited Therefore, when synthesizing adapters or library amplification primers with the double-ended tags in the composition, the base balance of the double-ended library tags can be obtained. For multiple good libraries, when these multiple libraries are mixed and sequenced on the computer, the double-end tag reading accuracy of each library in these mixed libraries can be made high, thereby improving the effective splitting rate of the library.
  • the lengths of the library tags at the 5' ends are the same as the lengths of the library tags at the 3' ends, preferably 6 Any fixed length between ⁇ 10bp; the lengths of the library tags at both ends are the same, so when splitting the sample, the library tags at both ends participate in the same number of bases in determining the source of the sample, so the probability of providing support from the libraries at both ends is the same. It can avoid that the reference probability of providing support with a longer library tag at one end is higher, and the library tag at the other end is shorter, and the reference probability of providing support is lower, resulting in a split result that is more inclined to rely on the library tag at one end.
  • the double-ended library tag composition there are at least 3 base differences between any two library tags; and the number of consecutive identical bases in any one library tag does not exceed 3; preferably, any one library tag
  • the GC content of the tags is 40-60%; when the library tags that meet the above-mentioned base optimization principles are used in combination, the base reading balance is better, the reading results are more accurate, and the data splitting rate is also higher.
  • the double-ended library tag composition comprises a combination of 4-tag balanced double-ended library tags, or a combination of 8-tag balanced double-ended library tags, wherein the combination of 4-tag balanced double-ended library tags is 4n 5'
  • the combination of library tags at the 5' end and 4n library tags at the 3' end, and the combination of the 8-tag balanced paired-end library tags is the combination of 8n library tags at the 5' end and 8n library tags at the 3' end, where n is greater than or equal to 1 is a natural number.
  • the library tag at the 5' end is selected from any one or more groups in the 96 groups shown in Table 1, and the library tag group at the 3' end is selected from any one or more groups. Any one or more groups selected from the 96 groups of library tags shown in Table 1 different from the 5' end.
  • the library tag at the 5' end is selected from any one or more of the 48 groups shown in Table 2, and the library tag group at the 3' end is selected from any one or more groups. Any one or more groups selected from the 48 groups of library tags shown in Table 2 different from the 5' end.
  • an amplification primer composition with double-end library tags based on MGI sequencing platform includes a plurality of amplification primers with double-end library tags Combination of primer pairs, each amplification primer pair includes: a library tag at the 5' end and a library tag at the 3' end, the lengths of the library tags at the 5' end of multiple amplification primer pairs are all the same, and the 3 The library tags at the ' end are all the same length, and the same number of occurrences of each base at the same position.
  • the length of the library tag at the 5' end of each amplification primer pair in the combination is also the same, and defining the same number of occurrences of each base at the same position, so that the composition of the amplification primer pair has the same length.
  • the double-ended tag in the primer composition is used to label multiple mixed samples for on-machine sequencing, and the reading of the bases of the tag is balanced, thereby making the reading result more accurate, and further making the split according to the tag.
  • the sample data is also more accurate, improving the sample splitting rate.
  • the 5'-end library tags of the above mixed samples have the same length and the 5'-end library tags have the same length, in order to further improve the base balance and reading accuracy of the library tags, in a preferred embodiment, the above
  • the lengths of the library tags at the 5' ends of the multiple amplification primer pairs are the same as the lengths of the library tags at the 3' ends of the multiple amplification primer pairs.
  • the lengths of the library tags at both ends of each pair of amplification primers are the same, so that when the sample is split, the library tags at both ends participate in the same number of bases in determining the source of the sample, so the libraries at both ends have the same probability of providing support, which can avoid the need for libraries at one end.
  • the length of the library tag at the 5' end and the library tag at the 3' end is any fixed length between 6 and 10 bp, more preferably 10 bp.
  • the preferred length here is 10 bp, which has a greater degree of discrimination and more beneficial effects of selection and combination than other lengths such as 6 bp or 8 bp.
  • the above amplification primer composition there are at least 3 base differences between any two library tags;
  • the number of identical bases does not exceed 3; more preferably, the GC contents of the library tags at the 5' end and the library tags at the 3' end are both 40-60%.
  • the above-mentioned amplification primer composition comprises a combination of 4 sets of tag-balanced 4n amplification primer pairs, or a combination of 8 sets of tag-balanced 8n amplification primer pairs, wherein n is A natural number greater than or equal to 1.
  • the library tag at the 5' end is selected from any one or more of the 96 groups shown in Table 1 above, and the library tag group at the 3' end is selected from the table.
  • the 96 sets shown in 1 differ from any one or more of the set of library tags at the 5' end.
  • the number of groups here is determined according to actual needs.
  • the combination of 96 groups of tag sequences in Table 1 has higher reading accuracy, so the data splitting is more accurate, and the splitting rate is also higher.
  • the library tag at the 5' end is selected from any one or more of the 48 groups shown in Table 2, and the library at the 3' end is selected from any one or more groups.
  • the tag set is selected from any one or more of the 48 sets of library tag sets different from the 5' end shown in Table 2 above.
  • each amplification primer pair also includes a 5'-end universal amplification sequence and a 3'-end universal amplification sequence
  • the 5'-end universal amplification sequence includes a universal sequence located upstream of the library tag at the 5' end and the universal sequence located downstream of the library tag at the 5' end
  • the universal amplified sequence at the 3' end includes the universal sequence located upstream of the library tag at the 3' end and the universal sequence located downstream of the library tag at the 3' end.
  • the specific sequence of the universal amplification sequence in each amplification primer pair above is determined according to the universal sequence of the existing MGI sequencing platform.
  • the amplification primer combination formed by the amplification primer pair comprising the above-mentioned improved library tag of the present application can improve the reading accuracy of the library tag when performing mixed-sample on-machine sequencing, thereby improving the accuracy of splitting the sequencing data of each sample sex and split rate.
  • the library construction can use a relatively short bubble linker (that is, the number of unpaired bases in the middle region is 30 ⁇ 5 bp), or a relatively long bubble linker (the number of unpaired bases in the middle region is 45 ⁇ 5 bp) 5bp).
  • the universal sequence in the amplification primer pair here can also be adjusted to a longer or shorter universal amplification sequence according to the length of the bubble-shaped junction.
  • the universal sequence upstream of the library tag at the 5' end is SEQ ID NO: 793
  • the universal sequence downstream of the library tag at the 5' end is SEQ ID NO: 793 ID NO: 794
  • the generic sequence upstream of the library tag at the 3' end is SEQ ID NO: 795
  • the generic sequence downstream of the library tag at the 3' end is SEQ ID NO: 796.
  • the universal sequence located upstream of the library tag at the 5' end is SEQ ID NO: 793, and the universal sequence located downstream of the library tag at the 5' end is SEQ ID NO: 797; the generic sequence upstream of the library tag at the 3' end is SEQ ID NO: 795, and the generic sequence downstream of the library tag at the 3' end is SEQ ID NO: 798.
  • a library construction kit based on the MGI sequencing platform, the kit comprising any of the above-mentioned amplification primer compositions.
  • the above-mentioned kit may further include a bubble-shaped joint of the MGI sequencing platform, and the bubble-shaped joint includes a first linker sequence and a second linker sequence, and the first linker sequence is SEQ ID NO: 769, and the first linker sequence is SEQ ID NO: 769.
  • the second linker sequence is SEQ ID NO:770, or the first linker sequence is SEQ ID NO:773, and the second linker sequence is SEQ ID NO:774.
  • the improved short vesicle adapters are more stable and efficient than the longer vesicle adapters in procedures such as PCR amplification after adapter ligation. Compatible.
  • a method for constructing a sequencing library based on an MGI sequencing platform is also provided, which is constructed by any of the above-mentioned kits.
  • the library constructed by using the above-mentioned kit of the present application when mixed on the machine for sequencing, the balance of the library tags is better, the reading accuracy of the library tags is higher, and the sequencing data of each subsequent sample is split more easily. Accurate, and the data split rate is also higher.
  • a sequencing library is also provided, and the sequencing library comprises any of the above-mentioned amplification primer compositions, or is constructed by any of the above-mentioned methods.
  • the library tags of multiple samples in the sequencing library are more balanced, the reading accuracy of the library tags after on-machine sequencing is higher, and the subsequent library splitting rate is also higher.
  • DNA sample fragmentation ---end repair and A---adapter ligation---fragment screening---PCR amplification---library purification, quantification and quality inspection---after sequencing or targeted capture using MGI platform Sequencing.
  • SEQ ID NO: 769 (31 bp)/phos/agtcggaggccaagcggtcttaggaagacaa;
  • SEQ ID NO: 770 (40bp): ttgtcttcctaacaggaacgacatggctacgatccgact*t.
  • gcatggcgaccttatcagnnnnnnnnnnnnnnnttgtcttcctaagaccgcttggcc wherein the sequence before nnnnnnnnnn (gcatggcgaccttatcag) is denoted as SEQ ID NO:795, and the sequence after nnnnnnnnnnn ( ttgtcttcctaagaccgcttggcc ) is denoted as SEQ ID NO:796 (the two bases CC in bold and underlined at the end are compared with scheme 2 extra part).
  • the complementary region of the linker part is 7+13bp (in the range of 20 ⁇ 2bp), and the intermediate vesicle structure region is 20+12bp (in the range of 30 ⁇ 5bp);
  • the amplification primers are relatively long.
  • the annealed structure is stable.
  • the amplification primers are compatible with the single-ended amplicon solution and the molecular tag adapter solution (see the patent for molecular tagging of plasma library construction with the application number of 201910229527.4).
  • SEQ ID NO: 773 (35bp): /phos/agtcggaggccaagcggtcttaggaagacaatcag.
  • the complementary region of the linker part is 7+17bp (belongs to the range of 25 ⁇ 2bp), and the intermediate vesicle structure region is 34+12bp (belongs to the range of 45 ⁇ 5bp);
  • the amplification primer is relatively short, see the amplification primer section.
  • the amplification primer has poor compatibility and is not compatible with any other scheme (due to the relatively short sequence of the amplification primer and the lack of overlapping region with the bubble region of scheme 1, it is difficult to be compatible with the linker sequence of scheme 1).
  • Scheme 1 and scheme 2 can successfully build the library normally, and the library output is relatively close, as shown in Figure 9.
  • the second solution is not compatible with the amplicon and molecular tag adapters developed by the single-end tag of the MGI platform.
  • Example 2 4-balanced and 8-balanced 12-sample mixed data split comparison
  • the double-ended tagging scheme can effectively remove the crosstalk between samples (also called tag skipping), but since splitting data requires correct tags on both ends to split valid sequencing data, the tag balance requirement when running the computer Stricter than single-ended tags.
  • This application optimizes two sets of 4-balance and 8-balance solutions. In this example, 4-balance and 8-balance are used respectively, and 12 library mixed samples are tested on the computer to detect the effective splitting rate of each sample by the two sets of solutions.
  • the specific experimental steps and information are as follows:
  • the 4-balanced double-ended tag sequences used in the experiment are shown in Table 4 below.
  • the adjacent 4 groups are balanced, and each group is distinguished by bold or non-bold fonts.
  • Tag 1 is the forward arrangement of 384 sequences.
  • Label 2 is the reverse arrangement of the 384 labels.
  • the index 1 of primer 1 and the index 384 of primer 2 constitute the first group of double-ended index primer combinations; the index 2 of primer 1 and the index 2 of primer 2 and 383 of primer 2 constitute the second group of double-end index primer combinations, which are arranged in sequence to form 384 combinations.
  • Combination number Label 1 number Tag 1 sequence Label 2 number Tag 2 sequence XDI001 1 (SEQ ID NO: 1) tcacattgct 384 (SEQ ID NO: 384) gatagtaacg XDI002 2 (SEQ ID NO: 2) aatggcgctc 383 (SEQ ID NO: 383) tgagtggcta XDI003 3 (SEQ ID NO: 3) gtctcaatga 382 (SEQ ID NO: 382) ccgtcattac XDI004 4 (SEQ ID NO: 4) cggatgcaag 381 (SEQ ID NO: 381) atccaccggt XDI005 5 (SEQ ID NO: 5) tcgcttaagc 380 (SEQ ID NO: 380) gcaactgtga
  • XDI006 6 (SEQ ID NO: 6) cgaggcttag 379 (SEQ ID NO: 379) atccaccacc XDI007 7 (SEQ ID NO: 7) gtctaaggct 378 (SEQ ID NO: 378) cgtgtgacat XDI008 8 (SEQ ID NO: 8) aatacgccta 377 (SEQ ID NO: 377) tagtgatgtg XDI009 9 (SEQ ID NO: 9) aagcctattg 376 (SEQ ID NO: 376) gcttgttcag XDI010 10 (SEQ ID NO: 10) cgctactgca 375 (SEQ ID NO: 375) aacaagcact XDI011 11 (SEQ ID NO: 11) tcaagagcat 374 (SEQ ID NO: 374) ttgccagtga XDI
  • Combination number Label 1 number Tag 1 sequence Label 2 number Tag 2 sequence MDI001 1 (SEQ ID NO: 385) cgtcgatgac 384 (SEQ ID NO: 768) taacacgacg MDI002 2 (SEQ ID NO: 386) atataaggcg 383 (SEQ ID NO: 767) tgttctcttc MDI003 3 (SEQ ID NO: 387) gatcgtgctc 382 (SEQ ID NO: 766) gagttcacaa MDI004 4 (SEQ ID NO: 388) cagtcttcgg 381 (SEQ ID NO: 765) ctgatgtcct MDI005 5 (SEQ ID NO: 389) agaacgatct 380 (SEQ ID NO: 764) agacagtggc MDI006 6 (SEQ ID NO: 390) ttggtgcatt 379 (SEQ ID NO: 763) ctcac
  • the same human genome standard was constructed with 12 4-balanced and 12 8-balanced double-ended tag sequences respectively, and 12 4-balanced libraries.
  • the double-ended tag sequences were listed in the order in Table 4; 12 8-balanced
  • the double-ended tag sequences of the library are listed in the sequence in Table 5.
  • the 4-balanced and 8-balanced libraries were sequenced and analyzed with paired-end tags on the MGI sequencing platform, respectively.
  • Two rounds of splitting are performed on the offline data of the two groups of mixed-sample libraries.
  • the first round is split with the maximum fault tolerance (a scheme that will also split the sequencing error back), and the second round only allows one fault-tolerant split per tag. .
  • the results after data splitting are shown in Figure 10.
  • the split rate of the mixed test data of 12 libraries with 4 balance is more stable, and the split rate of mixed test data of 12 libraries with 8 balance is relatively large. This shows that the strict balance of double-end tags is more conducive to the effective splitting of MGI sequencers, and the 8-balanced design can improve the effective splitting rate of data to a certain extent, while the 4-balanced design is more effective for data splitting.
  • the 8-balanced 48 sets of tag sequences in this application were designed with the 8-balanced set of tag sequences provided by Huada Manufacturing.
  • the compatibility of the balanced 12 groups of tag sequences when used on the machine therefore, there are 3 bases in any two sequences between the 8-balanced 48-group tag sequences of this application and the 8-balanced 12-group tag sequences provided by BGI base differences.
  • the base composition of the tag sequence of the present invention is more balanced, and the GC% content is 40%-60%; while the GC% content made by MGI is 20%-80%;
  • the matching calculation of the tag sequences of the present invention and the linker sequences of Scheme 1 is carried out to ensure the balanced output of the amplification efficiency of the amplified library; and the sequence amplification efficiency of MGI does not satisfy the amplification efficiency of individual sequences. Balance requirements.
  • the output of the 8-balanced group of the present invention is relatively balanced, while the output value of a library manufactured by MGI is lower than half of the normal value, indicating that the tag sequence optimized by the present invention has better balance after screening. , and the amplification efficiency is more stable.
  • the two sets of 384 tags of the present invention compared with the 120 tags manufactured by BGI, can better meet the multi-sample mixed sequencing throughput requirements.
  • Combination number Label 1 number Tag 1 sequence Label 2 number Tag 2 sequence MGI001 1 (SEQ ID NO: 777) atgcatctaa 120 (SEQ ID NO: 785) tagaggacaa MGI002 2 (SEQ ID NO: 778) agctctggac 119 (SEQ ID NO: 786) cctagcgaat MGI003 3 (SEQ ID NO: 779) ctatcacgtg 118 (SEQ ID NO: 787) gtagtcatcg MGI004 4 (SEQ ID NO: 780) ggactagtgg 117 (SEQ ID NO: 788) gctgagctgt MGI005 5 (SEQ ID NO: 781) gccaagtcca 116 (SEQ ID NO: 789) aacctagata MGI006 6 (SEQ ID NO: 782) cctgtcaagc 115 (SEQ ID NO: 790) tt
  • library number library yield library number library yield MGI001 1328 MDI001 1386 MGI002 1251 MDI002 1255 MGI003 1196 MDI003 1229 MGI004 1267 MDI004 1311 MGI005 667 MDI005 1307 MGI006 1345 MDI006 1238 MGI007 1257 MDI007 1233 MGI008 1344 MDI008 1274
  • the present application introduces double-end library tags on the MGI sequencing platform, and splits the data through the tag sequences at both ends of the sample, so as to eliminate the crosstalk caused by the synthesis, experimental links and on-machine sequencing. problem, will make the detection result more accurate.
  • the application has tested and optimized the special structure and found that the unpaired region in the middle of the bubble joint is 30 ⁇ 5bp, and the paired base is 20 ⁇ 2bp, the effect is optimal, so
  • the formed bubble-shaped adapters are the most stable for annealing, and at the same time, the corresponding amplification primers are extended amplification primers, which are compatible with single-end tagged amplicons and molecular tag adapters.
  • the bubble adapter of this structure is used in combination with the extended amplification primer (with double-end library tag) for library construction, it can be compatible with the module of the single-end tag solution of the existing MGI platform, which is convenient for MGI sequencer sequencing applications.
  • the application optimizes 384 kinds of tag sequences in the 4-balance and 8-balance, and provides the high-throughput version of the MGI sequencer for sequencing and data splitting on the computer. optimal solution.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Immunology (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medicinal Chemistry (AREA)
  • General Chemical & Material Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明提供了一种双端文库标签组合物及其在MGI测序平台中的应用。其中,该双端文库标签组合物包括:多个5'端的文库标签和多个3'端的文库标签,多个的5'端的文库标签长度均相同,多个3'端的文库标签的长度均相同,且在双端文库标签组合物中,相同位置上每种碱基出现的次数相同。利用优化的双端文库标签进行数据拆分,能够解决合成、实验环节和上机测序过程中导致的串扰问题。而控制每个5'端的文库标签的长度相同,3'端的文库标签的长度也相同,且相同位置上每种碱基出现的次数相同,能够获得双端文库标签碱基平衡性很好的多个文库,将这多个文库混合上机测序时,各文库的双端标签读取准确性高,进而提高文库有效拆分率。

Description

双端文库标签组合物及其在MGI测序平台中的应用 技术领域
本发明涉及血浆DNA建库领域,具体而言,涉及一种双端文库标签组合物及其在MGI测序平台中的应用。
背景技术
在MGI高通量测序仪测序过程中,为了实现更多的样本测序,需要把每个样本用不同的标签序列(Index)进行标记测序后进行拆分。但目前的MGI测序平台使用的基本都是单端标签的文库。由于单端标签(Index)存在着天然的缺陷,容易导致样本之间串扰的发生。由于标签接头或引物在合成、实验操作以及测序的各个环节导致的污染,使得相互串扰是不可避免的,所以需要解决样本之间的低频度的相互串扰,目前最好的方式是用双端标签的方法来解决,用双端标签方法能够有效去除样本之间的相互串扰。
但采用双端标签相比单端标签,在测序数据拆分时,测序仪是否能够准确地读取标签序列,会严重影响测序数据有效拆分。如果双端标签序列读取有问题,则会降低测序数据的有效拆分率,进而增加测序成本。
因此,如何利用双端标签的方式标记混合测序的文库,一方面降低样本串扰现象,另一方面提高多样本混合测序后的数据有效拆分率。
发明内容
本发明的主要目的在于提供一种双端文库标签组合物及其在MGI测序平台中的应用,以解决现有MGI测序平台利用单端标签标记文库容易发生样本串扰的问题。
为了实现上述目的,根据本申请的一个方面,提供了一种双端文库标签组合物,双端文库标签组合物包括:多个5’端的文库标签和多个3’端的文库标签,多个的5’端的文库标签长度均相同,多个3’端的文库标签的长度均相同,且在双端文库标签组合物中,相同位置上每种碱基出现的次数相同。
进一步地,多个5’端的文库标签的长度与多个3’端的文库标签的长度相同,优选均为6~10bp之间的任一固定长度;优选地,双端文库标签组合物中,任意两种文库标签间至少存在3个碱基差异;且任意一种文库标签中连续相同的碱基数目不超过3个;优选地,任意一种文库标签的GC含量为40~60%;优选地,双端文库标签组合物包括4标签平衡的双端文库标签的组合,或8标签平衡的双端文库标签的组合,其中,4标签平衡的双端文库标签的组合为4n个5’端的文库标签和4n个3’端的文库标签的组合,8标签平衡的的双端文库标签的组合为8n个5’端的文库标签和8n个3’端的文库标签的组合,其中,n为大于等于1的自然数。
进一步地,4标签平衡的双端文库标签的组合中,5’端的文库标签选自表1所示的96组中的任意一组或多组,3’端的文库标签组选自表1所示的96组不同于5’端的文库标签组的任意一组或多组。
进一步地,8标签平衡的双端文库标签的组合中,5’端的文库标签选自表2所示的48组中的任意一组或多组,3’端的文库标签组选自表2所示的48组不同于5’端的文库标签组的任意一组或多组。
根据本发明的第二个方面,提供了一种基于MGI测序平台的带双端文库标签的扩增引物组合物,该扩增引物组合物包括多个带双端文库标签的扩增引物对的组合,每个扩增引物对包括5’端的文库标签和3’端的文库标签,多个扩增引物对的5’端的文库标签的长度均相同,多个扩增引物对的3’端的文库标签的长度均相同,且相同位置上每种碱基出现的次数相同。
进一步地,多个扩增引物对的5’端的文库标签的长度与多个扩增引物对的3’端的文库标签的长度相同;优选地,5’端的文库标签和3’端的文库标签的长度均为6~10bp之间的任一固定长度;优选地,扩增引物组合物中,任意两种文库标签之间至少存在3个碱基差异;且任意一种文库标签中连续相同的碱基数目不超过3个;优选地,多个5’端的文库标签和多个3’端的文库标签的GC含量均为40~60%;优选地,扩增引物组合物包括4标签平衡的4n个扩增引物对的组合,或者为8标签平衡的8n个扩增引物对的组合,n为大于等于1的自然数。
进一步地,4标签平衡的4n个扩增引物对中,5’端的文库标签选自表1所示的96组中的任意一组或多组,3’端的文库标签组选自表1所示的96组不同于5’端的文库标签组的任意一组或多组;优选地,8标签平衡的8n个扩增引物对中,5’端的文库标签选自表2所示的48组中的任意一组或多组,3’端的文库标签组选自表2所示的48组不同于5’端的文库标签组的任意一组或多组。
进一步地,每个扩增引物对还包括5’端通用扩增序列和3’端通用扩增序列,5’端通用扩增序列包括位于5’端的文库标签上游的通用序列和位于5’端的文库标签下游的通用序列,3’端通用扩增序列包括位于3’端的文库标签上游的通用序列和位于3’端的文库标签下游的通用序列;优选地,位于5’端的文库标签上游的通用序列为SEQ ID NO:793,位于5’端的文库标签下游的通用序列为SEQ ID NO:794;位于3’端的文库标签上游的通用序列为SEQ ID NO:795,位于3’端的文库标签下游的通用序列为SEQ ID NO:796;或者
位于5’端的文库标签上游的通用序列为SEQ ID NO:793,位于5’端的文库标签下游的通用序列为SEQ ID NO:797;位于3’端的文库标签上游的通用序列为SEQ ID NO:795,位于3’端的文库标签下游的通用序列为SEQ ID NO:798。
根据本发明的第三个方面,提供了一种测序文库构建试剂盒,试剂盒包括上述任一种扩增引物组合物。
进一步地,试剂盒还包括泡状接头,泡状接头包括第一接头序列和第二接头序列,第一接头序列为SEQ ID NO:769,第二接头序列为SEQ ID NO:770,或者第一接头序列为SEQ ID NO:773,第二接头序列为SEQ ID NO:774。
根据本发明的第四个方面,提供了一种基于MGI测序平台的测序文库的构建方法,方法采用上述试剂盒进行构建。
根据本发明的第五个方面,提供了一种测序文库,测序文库包括上述双端文库标签组合,或者上述任一种扩增引物组合物。
应用本发明的技术方案,通过引入双端文库标签及优化的双端文库标签组合,利用双端文库标签进行数据拆分,能够解决合成、实验环节和上机测序过程中导致的串扰问题,会使检测结果更准确。进一步地,通过控制双端文库标签组合中每个5’端的文库标签的长度相同,3’端的文库标签的长度也相同,并且限定相同位置上每种碱基出现的次数相同,使得该组合物中的双端标签的碱基出现概率相同,因而合成带有该组合物中的双端标签的接头或文库扩增引物时,能够获得双端文库标签碱基平衡性很好的多个文库,将这多个文库混合上机测序时,能够使得这些混合文库中各文库的双端标签读取准确性高,进而提高文库有效拆分率。
附图说明
构成本申请的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1A、图1B及图1C示出了MGI测序平台采用双端标签比单端标签去除串扰的优势;
图2A和图2B示出了MGI的单端标签接头;
图3A和图3B示出了MGI的双端标签接头;
图4示出了MGI平台两种双端标签建库实现过程;
图5示出了本申请的本发明的双端标签方案和单端标签扩增子方案兼容;
图6示出本申请的双端标签扩增引物兼容单端标签的分子标签接头;
图7A和图7B示出了4平衡和8平衡标签序列碱基平衡型;
图8示出了4平衡和8平衡在多杂过程中碱基平衡性对比;
图9示出了两种建库方案产出对比;
图10示出了4平衡和8平衡在12杂测序过程中数据拆分差异。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例只是一个特例说明,并不是唯一性限定,本申请中的实施例及实施例中的特征可以相互组合。下面将结合实施例来详细说明本发明。
术语解释:
双端标签接头:高通量测序时需要每个片段末端连接通用的测序接头,接头的不互补区域各有一个可变序列区域序列是标签序列,是用来测序时拆分数据用。
标签序列碱基平衡:DNA序列有四种碱基组成,即A、T、G和C,为了测序过程中的有效读取,组合出一组标签序列保证标签序列的每个位置碱基占比相等。
如背景技术所提到的,MGI的高通量测序仪用单端标签序列建库会有一定比例的样本之间的相互串扰(这是在Illumina的测序平台上也存在的现象,虽然MGI平台的测序过程和Illumina平台有很大区别,但是在接头序列合成、建库和杂交捕获过程避免不了会造成样本之间的标签相互串扰)。如图1A所示,如果在实验环节有1%的相互串扰,无论是接头合成、文库构建、杂交捕获还是上机测序,都会具有相同的串扰效果。目前解决样本之间串扰的最好方法是在建库的过程引入双端标签序列,如图1B所示,解决串扰的问题只能是尽量控制各个实验环节的同时引入双端标签序列的方法。如图1C所示双端标签方案会比单端方案降低串扰100倍(1%到0.01%)。
因此,为解决MGI测序平台的样本串扰问题,本申请也试图从MGI现有的单端标签改为双端标签来解决。具体的研发思路和历程如下:
MGI的建库方案是采用泡状接头,不同于Illumia的Y型接头,MGI的单端标签可以融合在接头之中(如图2B),也可以是分开的方案(如图2A);但是双端标签的序列不能和前端序列融合在一起(如图3B,如果在前端融合标签序列,由于前端互补区只有7bp,中间的泡状结构会更长,这种结构的稳定性极差,实现效率很低,实现效果不如标签序列引物和通用接头分开的截断型方案更优效率),只能采用通用接头和分开的双端标签的扩增引物结构形式(如图3A所示)。发明人根据图3A所示结构形式连接双端标签,但在实际应用的过程中发现,泡状接头中间泡状太大会影响退火二级结构稳定,退火不好会影响接头连接效率(平均连接效率为20%-40%)。MGI的泡状接头不同于Illumina的Y型接头的双端标签可以融合在一起。
进一步研究发现,在MGI的泡状接头的中间区域不配对碱基可以是30±5bp,此时配对碱基为20±2bp时更容易形成稳定的退火连接,从而提高连接效率,如图4中的方案一;中间区域不配对碱基也可以是45±5bp,此时配对碱基为25±2bp时形成的退火连接更稳定,连接效率也更高,如图4中的方案二。而且,发明人进一步比较发现,相较方案二,方案一有下列好处,第一点,泡状区是30±5bp接头退火稳定,需要互补的区域少,稳定有利于连接;第二点,可以兼容单端标签的扩增子,扩增子可以在单双端标签中切换,如图5;第三点,可以兼容单端分子标签接头,如图6。
随着进一步研究的深入,发明人还发现,尽管上述方案一较方案二有诸多优势,但如果是想获得带有双端标签的MGI测序平台的测序文库,无论方案一还是方案二均能实现该目的。如果进一步利用双端标签构建的文库进行上机测序及测序后对数据进行拆分,发明人又发现MGI的双端标签接头在测序时碱基平衡要求比单端标签要求的还严格,要两端的标签序列都对才能拆分出来数据,如图1B。也就是说虽然双端标签解决了样本的串扰问题,但是对上机测序的碱基平衡性要求极其严格,碱基平衡性差会严重影响测序数据的准确读取,进而影响数据的有效拆分。
为了进一步更准确地对数据拆分,以双端标签的碱基数均为10为例,发明人根据如下规则对双端标签进行了碱基平衡优化,碱基筛选的规则如下:1)每种标签序列之间存在3个碱基的差异;2)每条序列的GC含量控制在0.4-0.6之间;3)相同连续碱基数不能超过3个。根据该规则对筛选出的每一条标签序列进行了二级结构评估,以评估该标签序列是否与扩增引物3’端的通用引物形成发卡折叠等二级结构,进而会降低该扩增引物的扩增效率,同时也对整个混样文库中各标签碱基的平衡性造成影响,进一步影响后续各标签的读取准确性,从而降低测序数据拆分的准确性。
按照上述筛选优化规则,本申请优化了384种4标签平衡和384种8标签平衡的序列,4标签平衡是指4个一组的标签序列平衡,如图7A所示(对应于表4中的前1-4号标签),4个一组的标签序列在标签的1到10位,每个碱基A、T、G和C各有一个。同样,4标签平衡是指4个一组的标签序列平衡,如图7B所示(对应于表5中的前1-8号标签),8个一组的标签序列在标签的1到10位,每个碱基A、T、G和C各有两个。
根据本申请的多次试验表明,4个一组平衡是最小的平衡单位,是最优组合。4平衡的组合可以组合成4个、8个、12个以及16个等4的倍数的平衡的组合,8平衡的组合需要组合成8个和16个等8的倍数的平衡组合。如图8所示(左侧的4平衡组合的标签序列对应于表1中的前4组扩增引物组所携带的文库标签组合,右侧的8平衡组合的文库标签,对应于表2中的前2组扩增引物组所携带的文库标签组合),当4个文库标签混合上机测序时,4平衡中各碱基都是均衡出现的,因而各碱基的占比均为25%,而采用8平衡组合的文库标签时,各碱基的占比为0~50%。而当8的倍数,比如8个或16个样本混合上机时,文库标签组合后各碱基的占比能够达到平衡,均为25%。而当12个样本混合上机测序时,8平衡组合的文库标签中各碱基的占比在16.7%~33.3%之间。
此外,4平衡的非整数倍的平衡性也优于8平衡的组合,4平衡的应用更有利于MGI测序仪的安排上机。随着MGI的测序仪的测序通量越来越高,采用本申请优化的384种4平衡的标签组合,前后临近的每四个一组更有利于安排上机(见表1的4平衡的384种标签序列)。优化的384种8平衡的标签组合,前后临近的每八个一组,同样也利于安排上机(见表2的8平衡的384种标签序列)。
优选地,本申请中的两种平衡标签,在组成双端扩增引物时引物1序列是384种编号的正向排列;引物2是384种编号的反向排列,这只是本发明的一种优选推荐的排列方式。在实际应用中,也可以根据实际需要进行组合安排。比如,如下表1中,引物1选择96组中的 任意一组时,引物2可以选择其余95组中的任意一组。当然,如果所需混样的样本数目大于4时,比如为8个或12个时,只要引物1所选择的标签组编号与引物2所选择的标签组编号不同即可。比如引物1选择前3组,引物2可以从剩余的93中任意选择3组。依次类推,只要是4的整数倍的样本进行混样上机测序时,就可以按照该规则进行选择双端的文库标签。
而当要混合的样本不是4的整数倍时,优先安排样本测序数据量大的4个安排在一组平衡标签组合,小样本测序量的小于4的样本安排另一组平衡的其它标签组合建库上机测序,这种情况的安排4平衡的组合要明显比8平衡的组合有优势;4平衡的组合除了在4的整数倍有一半比8平衡的有优势外(4,12,20),非整数倍的组合也优于8平衡的组合,在4n+1和4n+2的时候平衡性也优于8平衡组合的相同样样本本混比测序。所以4平衡和8平衡比较有如下优点:1)4平衡的组合平衡的样本的组合种类比8平衡的多一倍;2)在不平衡的安排的三组组合中,4n+1和4n+2的组合中平衡性也优于8平衡的组合;3)在样本之间测序数据量有差别时,4平衡的更好安排接***衡的组合,大数据样本优先安排平衡组合中,小测序量样本可以不平衡。
表1:
Figure PCTCN2020139919-appb-000001
Figure PCTCN2020139919-appb-000002
Figure PCTCN2020139919-appb-000003
表2:8平衡的384种标签序列
Figure PCTCN2020139919-appb-000004
Figure PCTCN2020139919-appb-000005
Figure PCTCN2020139919-appb-000006
Figure PCTCN2020139919-appb-000007
4平衡一组的上机数据拆分率会更高,因为测序仪对均衡组成的碱基读取更准确,碱基不平衡会读取错误导致数据拆分率降低。在12个样本等比混合上机时,分别用4平衡和8平衡的标签序列建库测序,从数据的拆分结果来看,如图10所示,4平衡的标签序列,12个样本的数据拆分波动不大,8平衡的标签序列,12个样本的数据拆分会有部分样本明显降低。
在上述研究结果的基础上,申请人提出了本申请的技术方案。
在本申请一种典型的实施方式中,提供了一种双端文库标签组合物,该双端文库标签组合物包括:多个5’端的文库标签和多个3’端的文库标签,多个的5’端的文库标签长度均相同,多个3’端的文库标签的长度均相同,且在双端文库标签组合物中,相同位置上每种碱基出现的次数相同。
本申请所提供的双端文库标签组合物,通过限定该组合中每个5’端的文库标签的长度相同,3’端的文库标签的长度也相同,并且限定相同位置上每种碱基出现的次数相同,使得该组合物中的双端标签的碱基出现概率相同,因而合成带有该组合物中的双端标签的接头或文库扩增引物时,能够获得双端文库标签碱基平衡性很好的多个文库,将这多个文库混合上机测序时,能够使得这些混合文库中各文库的双端标签读取准确性高,进而提高文库有效拆分率。
为进一步提高文库标签的碱基均衡性及读取准确性,在一种优选的实施例中,多个5’端的文库标签的长度与多个3’端的文库标签的长度相同,优选均为6~10bp之间的任一固定长度;两端的文库标签的长度相同,这样在拆分样本时,两端的文库标签所参与判定样本来源的碱基数目相同,因而两端文库提供支持的概率相同,能够避免一端文库标签较长提出支持的参考概率较高,另一端文库标签较短,提供支持的参考概率较低,从而导致拆分结果更偏向于依赖某一端的文库标签的拆分结果。
优选地,双端文库标签组合物中,任意两种文库标签间至少存在3个碱基差异;且任意一种文库标签中连续相同的碱基数目不超过3个;优选地,任意一种文库标签的GC含量为40~60%;满足上述碱基优化原则的文库标签在组合使用时,碱基读取的平衡性更好,读取结果也更准确,数据拆分率也更高。
优选地,双端文库标签组合物包括4标签平衡的双端文库标签的组合,或8标签平衡的双端文库标签的组合,其中,4标签平衡的双端文库标签的组合为4n个5’端的文库标签和4n个3’端的文库标签的组合,8标签平衡的的双端文库标签的组合为8n个5’端的文库标签和8n个3’端的文库标签的组合,其中,n为大于等于1的自然数。
在一种优选的实施例中,4标签平衡的双端文库标签的组合中,5’端的文库标签选自表1所示的96组中的任意一组或多组,3’端的文库标签组选自表1所示的96组不同于5’端的文库标签组的任意一组或多组。
在一种优选的实施例中,8标签平衡的双端文库标签的组合中,5’端的文库标签选自表2所示的48组中的任意一组或多组,3’端的文库标签组选自表2所示的48组不同于5’端的文库标签组的任意一组或多组。
在本申请第二种典型的实施方式中,提供了一种基于MGI测序平台的带双端文库标签的扩增引物组合物,该扩增引物组合物包括多个带双端文库标签的扩增引物对的组合,每个扩增引物对包括:5’端的文库标签和3’端的文库标签,多个扩增引物对的5’端的文库标签的长度均相同,多个扩增引物对的3’端的文库标签的长度均相同,且相同位置上每种碱基出现的次数相同。
通过限定该组合中每个扩增引物对的5’端的文库标签的长度相同,3’端的文库标签的长度也相同,并且限定相同位置上每种碱基出现的次数相同,使得组成的该扩增引物组合物中的双端标签,在用于标记多个混样上机测序的样本时,标签碱基的读取保持平衡,进而使得读取结果更准确,进一步使得根据该标签拆分的样本数据也更准确,提高样本拆分率。
在上述混合样本的5’端文库标签长度相同以及5’端文库标签长度相同的基础上,为进一步提高文库标签的碱基均衡性及读取准确性,在一种优选的实施例中,上述多个扩增引物对的5’端的文库标签的长度与多个扩增引物对的3’端的文库标签的长度相同。每对扩增引物中两端的文库标签的长度相同,这样在拆分样本时,两端的文库标签所参与判定样本来源的碱基数目相同,因而两端文库提供支持的概率相同,能够避免一端文库标签较长提出支持的参考概率较高,另一端文库标签较短,提供支持的参考概率较低,从而导致拆分结果更偏向于依赖某一端的文库标签的拆分结果。
更优选地,5’端的文库标签和3’端的文库标签的长度均为6~10bp之间的任一固定长度,更优选为10bp。此处优选择长度为10bp,相比6bp或8bp等其他长度,具有更大的区分度和更多的选择组合的有益效果。
为提供碱基更均衡的文库标签,在一种优选的实施例中,上述扩增引物组合物中,任意两种文库标签之间至少存在3个碱基差异;且任意一种文库标签中连续相同的碱基数目不超过3个;更优选地,多个5’端的文库标签和多个3’端的文库标签的GC含量均为40~60%。满足上述碱基优化原则的文库标签在组合使用时,碱基读取的平衡性更好,读取结果也更准确,数据拆分率也更高。
在一种优选的实施例中,上述扩增引物组合物包括4组标签平衡的4n个扩增引物对的组合,或者为8组标签平衡的8n个扩增引物对的组合,其中,n为大于等于1的自然数。更优选地,4组标签平衡的4n个扩增引物对中,5’端的文库标签选自上述表1所示的96组中的任意一组或多组,3’端的文库标签组选自表1所示的96组不同于5’端的文库标签组的任意一组或多组。此处组数根据实际需要确定。表1中的96组标签序列的组合,读取准确性更高,因而数据拆分更准确,拆分率也更高。
在另一种优选的实施例中,8组标签平衡的8n个扩增引物对中,5’端的文库标签选自表2所示的48组中的任意一组或多组,3’端的文库标签组选自上述表2所示的48组不同于5’端的文库标签组的任意一组或多组。
上述扩增引物组合物中,每个扩增引物对还包括5’端通用扩增序列和3’端通用扩增序列,5’端通用扩增序列包括位于5’端的文库标签上游的通用序列和位于5’端的文库标签下游的通用序列,3’端通用扩增序列包括位于3’端的文库标签上游的通用序列和位于3’端的文库标签下游的通用序列。上述每个扩增引物对中的通用扩增序列的具体序列根据MGI现有测序平台的通用序列进行确定。利用包含本申请上述改进的文库标签的扩增引物对形成的扩增引物组合,在进行混样上机测序时,能提高文库标签的读取准确率,进而提高各样本测序数据的拆分准确性和拆分率。
根据前述,文库构建可以采用相对较短的泡状接头(即中间区别不配对碱基数为30±5bp),也可以采用相对较长的泡状接头(中间区域不配对碱基数为45±5bp)。相应地,此处的扩增引物对中的通用序列也可以根据泡状接头的长短,相应调整为较长或较短的通用扩增序列。
在一种优选的实施例中,与采用较短的泡状接头相对应的,位于5’端的文库标签上游的通用序列为SEQ ID NO:793,位于5’端的文库标签下游的通用序列为SEQ ID NO:794;位于3’端的文库标签上游的通用序列为SEQ ID NO:795,位于3’端的文库标签下游的通用序列为SEQ ID NO:796。
在另一种优选的实施例中,与采用较长的泡状接头相对应的,位于5’端的文库标签上游的通用序列为SEQ ID NO:793,位于5’端的文库标签下游的通用序列为SEQ ID NO:797;位于3’端的文库标签上游的通用序列为SEQ ID NO:795,位于3’端的文库标签下游的通用序列为SEQ ID NO:798。
在本申请第三种典型的实施方式中,还提供了一种基于MGI测序平台的文库构建试剂盒,该试剂盒包括上述任一种扩增引物组合物。利用具有上述碱基均衡性的扩增引物中双文库标签,能够使得混样测序后各样本的标签序列能够被准确读取,提高混样数据的样本拆分准确性和拆分率。
为进一步提高文库构建的便利性,上述试剂盒还可以进一步包括MGI测序平台的泡状接头,泡状接头包括第一接头序列和第二接头序列,第一接头序列为SEQ ID NO:769,第二接头序列为SEQ ID NO:770,或者第一接头序列为SEQ ID NO:773,第二接头序列为SEQ ID NO:774。改进的短泡状接头相比相对较长的泡状接头除了在接头连接步骤连接稳定性和连接效率更高外,在接头连接后的PCR扩增等程序中,相比较长的泡状接头更具兼容性。
在本申请第四种典型的实施方式中,还提供了一种基于MGI测序平台的测序文库的构建方法,该方法上述任一种试剂盒进行构建。利用本申请的上述试剂盒构建而成的文库,混合上机测序时,文库标签的均衡性更好,读取文库标签时的读取准确性更高,后续各样本的测序数据拆分也更准确,数据拆分率也更高。
在本申请第五种典型的实施方式中,还提供了一种测序文库,该测序文库包括上述任一种扩增引物组合物,或者采用上述任一种方法构建而成。该测序文库中多个样本的文库标签的均衡性更好,上机测序后文库标签的读取准确性更高,后续文库拆分率也更高。
下面将结合具体的实施例来进一步说明本申请的有益效果。需要说明的是,以下实施例采用NadPrep TM DNA文库构建试剂盒(for MGI),货号:
Figure PCTCN2020139919-appb-000008
血浆游离DNA双端分子标签文库构建试剂盒(for MGI),货号:1003811使用说明书V1.0(纳昂达(南京)生物科技有限公司)所提供的文库构建流程进行。具体流程简述如下:
DNA样本片段化---末端修复和加A---接头连接---片段筛选---PCR扩增---文库纯化、定量和质检---使用MGI平台测序或靶向捕获后测序。
还需要说明的是,以下实施例仅是示例性说明,并不限定本申请的方法仅能采用如下方法。
实施例1建库方案一与方案二
具体步骤:参考NadPrep TM DNA文库构建试剂盒(for MGI)(201909Version2.0)说明书
唯一的区别是泡状接头序列和扩增引物序列的差异
(1)方案一:
泡状接头序列:
SEQ ID NO:769所示的接头序列1和SEQ ID NO:770所示接头序列2:
SEQ ID NO:769:(31bp)/phos/agtcggaggccaagcggtcttaggaagacaa;
SEQ ID NO:770(40bp):ttgtcttcctaacaggaacgacatggctacgatccgact*t。
SEQ ID NO:771所示的扩增引物1和SEQ ID NO:772所示扩增引物2:
SEQ ID NO:771:(64bp)
/phos/ctctcagtacgtcagcagttnnnnnnnnnncaactccttggctcacagaacgacatggctacga;其中,nnnnnnnnnn之前的序列(/phos/ctctcagtacgtcagcagtt)记为SEQ ID NO:793,nnnnnnnnnn之后的序列(caactccttggctcacagaac gacatggctacga)记为SEQ ID NO:794(加粗及下划线的部分为相比方案二加长的部分)。
SEQ ID NO:772:(52bp)
gcatggcgaccttatcagnnnnnnnnnnttgtcttcctaagaccgcttggcc,其中,nnnnnnnnnn之前的序列(gcatggcgaccttatcag)记为SEQ ID NO:795,nnnnnnnnnn之后的序列(ttgtcttcctaagaccgcttgg cc)记为SEQ ID NO:796(末尾加粗及下划线的两个碱基CC是比方案二多出来的部分)。
方案一特点:
1.接头部分互补区是7+13bp(属于20±2bp范围),中间泡状结构区域是20+12bp(属于30±5bp范围);
2.扩增引物相对较长。
这样的好处有下面几点:
1.由于泡状区域比较短,所以退火结构稳定。
2.扩增引物兼容单端的扩增子方案和分子标签接头方案(参见申请号为201910229527.4的血浆建库分子标签专利)。
(2)方案二:
接头序列
SEQ ID NO:773所示的接头序列1和SEQ ID NO:774所示接头序列2。
SEQ ID NO:773(35bp):/phos/agtcggaggccaagcggtcttaggaagacaatcag。
SEQ ID NO:774(59bp):
ctgattgtcttcctaagcaactccttggctcacagaacgacatggctacgatccgactt。
SEQ ID NO:775所示的扩增引物1和SEQ ID NO:776所示扩增引物2。
SEQ ID NO:775:(51bp)
/phos/ctctcagtacgtcagcagttnnnnnnnnnncaactccttggctcacagaac。其中,nnnnnnnnnn之前的序列(/phos/CTCtcagtacgtcagcagtt)仍然为SEQ ID NO:793,nnnnnnnnnn之后的序列(caactccttggctcacagaac)记为SEQ ID NO:797。
SEQ ID NO:776:(50bp)
gcatggcgaccttatcagnnnnnnnnnnttgtcttcctaagaccgcttgg。其中,nnnnnnnnnn之前的序列(gcatggcgaccttatcag)仍然记为SEQ ID NO:795,nnnnnnnnnn之后的序列(ttgtcttcctaagaccgcttgg)记为SEQ ID NO:798。
这个方案的特点是:
1.接头部分互补区是7+17bp(属于25±2bp范围),中间泡状结构区域是34+12bp(属于45±5bp范围);
2.扩增引物比较短,见扩增引物部分。
与方案一相比,该方案有下面几点劣势:
1.由于泡状区域相对比较长,所以退火结构稳定性相对较差。
2.扩增引物兼容性差,不兼容任何其它方案(由于扩增引物序列相对较短,与方案一的泡状区域缺失重叠区域,因而难以与方案1的接头序列兼容)。
方案一和方案二的具体接头结构和扩增引物的扩增结果见图4,最终都能够实现MGI上机的双端标签文库,实验中分别做了25ng和100ng的投入量的建库实验,具体实验信息见下表。
表3:方案一和方案二建库产量对比表
Figure PCTCN2020139919-appb-000009
方案一和方案二都可以正常建库成功,并且文库产出也比较接近,见图9。但是方案二不能兼容MGI平台的单端标签开发的扩增子和分子标签接头。
实施例2 4平衡和8平衡的12个样本混合数据拆分比较
双端标签的方案可以有效去除样本之间的串扰(又叫标签跳越),但由于拆分数据需要两端的标签都正确,才能拆分出有效的测序数据,因此上机时的标签平衡要求比单端标签要求更严格。本申请优化了4平衡和8平衡的两套方案,本实施例分别采用4平衡和8平衡,对12个文库混样进行上机测试,以检测两套方案对各样本的有效拆分率,具体实验步骤和信息如下:
具体步骤:建库步骤参考NadPrep TM DNA文库构建试剂盒(for MGI)(201909Version2.0)说明书,唯一的区别在于:将单端标签接头改为双端标签接头建库方案。
实验中用到的4平衡双端标签序列如下表4所示,相邻4个一组平衡,每组用加粗或非加粗的字体进行区分,标签1是384条序列的正向排列,标签2是384条标签的反向排列。引物1的标签1和引物2的标签384组成第1组双端标签引物组合;引物1的标签2和引物2的383组成第2组双端标签引物组合,依次排列组合成384种组合。
8平衡的排列方式和4平衡的排列方式相同,唯一区别是8个一组平衡,见表5,当将12组文库标签放在一起时,前8个是平衡的,后4的是不平衡的,而相应的,4平衡的组合12组文库标签放在一起是完全平衡的。
表4:4平衡的12种双端标签序列组合
组合编号 标签1编号 标签1序列 标签2编号 标签2序列
XDI001 1(SEQ ID NO:1) tcacattgct 384(SEQ ID NO:384) gatagtaacg
XDI002 2(SEQ ID NO:2) aatggcgctc 383(SEQ ID NO:383) tgagtggcta
XDI003 3(SEQ ID NO:3) gtctcaatga 382(SEQ ID NO:382) ccgtcattac
XDI004 4(SEQ ID NO:4) cggatgcaag 381(SEQ ID NO:381) atccaccggt
XDI005 5(SEQ ID NO:5) tcgcttaagc 380(SEQ ID NO:380) gcaactgtga
XDI006 6(SEQ ID NO:6) cgaggcttag 379(SEQ ID NO:379) atccaccacc
XDI007 7(SEQ ID NO:7) gtctaaggct 378(SEQ ID NO:378) cgtgtgacat
XDI008 8(SEQ ID NO:8) aatacgccta 377(SEQ ID NO:377) tagtgatgtg
XDI009 9(SEQ ID NO:9) aagcctattg 376(SEQ ID NO:376) gcttgttcag
XDI010 10(SEQ ID NO:10) cgctactgca 375(SEQ ID NO:375) aacaagcact
XDI011 11(SEQ ID NO:11) tcaagagcat 374(SEQ ID NO:374) ttgccagtga
XDI012 12(SEQ ID NO:12) gttgtgcagc 373(SEQ ID NO:373) cgagtcagtc
表5:8平衡的12种双端标签序列组合
组合编号 标签1编号 标签1序列 标签2编号 标签2序列
MDI001 1(SEQ ID NO:385) cgtcgatgac 384(SEQ ID NO:768) taacacgacg
MDI002 2(SEQ ID NO:386) atataaggcg 383(SEQ ID NO:767) tgttctcttc
MDI003 3(SEQ ID NO:387) gatcgtgctc 382(SEQ ID NO:766) gagttcacaa
MDI004 4(SEQ ID NO:388) cagtcttcgg 381(SEQ ID NO:765) ctgatgtcct
MDI005 5(SEQ ID NO:389) agaacgatct 380(SEQ ID NO:764) agacagtggc
MDI006 6(SEQ ID NO:390) ttggtgcatt 379(SEQ ID NO:763) ctcacactta
MDI007 7(SEQ ID NO:391) gccgtcataa 378(SEQ ID NO:762) gccggtaagt
MDI008 8(SEQ ID NO:392) tccaaccaga 377(SEQ ID NO:761) actggaggag
MDI009 9(SEQ ID NO:393) gatagcaaga 376(SEQ ID NO:760) caacagtaac
MDI010 10(SEQ ID NO:394) accgtgcttc 375(SEQ ID NO:759) ataacgctca
MDI011 11(SEQ ID NO:395) gcagatgtaa 374(SEQ ID NO:758) gattgcgcct
MDI012 12(SEQ ID NO:396) tgttggagcg 373(SEQ ID NO:757) cggtgttgga
相同的人基因组标准品分别各用12个4平衡和12个8平衡的双端标签序列构建文库,12个4平衡的文库,双端标签序列如表4中列的先后顺序;12个8平衡的文库双端标签序列如表5中列的先后顺序。4平衡和8平衡的文库分别在MGI测序平台上进行双端标签测序和分析。
对两组混样文库的下机数据进行两轮拆分,第一轮用最大容错拆分(会把测序错误也拆分回来的方案),第二轮每个标签只允许一个容错的拆分。经过数据拆分后的结果如图10所示,4平衡的12个文库混测数据拆分率更稳定,8平衡的12个文库混测数据拆分波动相对较大。这表明,双端标签严格平衡更有利于MGI测序仪的有效拆分,其中8平衡设计能够在一定程度上提高数据的有效拆分率,而4平衡设计对数据拆分效果更优。
实施例3
为确保本申请8平衡48组标签序列与华大制造提供的8平衡的12组标签序列之前的性能差异,本申请的8平衡48组标签序列在设计时就考虑了与华大制造提供的8平衡的12组标签序列在上机使用时的兼容性,因此,本申请的8平衡48组标签序列与华大制造提供的8平衡的12组标签序列之间任意两条序列均存在3个碱基的差异。
此外,其他主要的区别点在于:
1.本发明的的标签序列的碱基组成更均衡,GC%含量是40%-60%;而华大智造的是GC%含量是20%-80%;
2.本发明的标签序列都进行了与方案1的接头序列的匹配性计算,保证扩增文库的扩增效率的均衡产出;而华大智造的序列扩增效率上个别序列不满足扩增均衡性要求。
为进一步验证上述在扩增均衡性方面的性能差异,下面选择一组本发明的MDI001-MDI008的8平衡标签序列和华大智造的MGI001-MGI008的8平衡标签序列(如表6所示)分别按照本发明的方案1建库测试:均采用100ng的DNA投入量,扩增5个循环后回收测定文库产出,结果见表7。
如表7所示,本发明的8平衡一组产出比较均衡,而华大智造的有一个文库产出值低于正常值的一半,表明本发明经过筛选优化出来的标签序列均衡性更好,进而扩增效率更稳定。同时由于目前的MGI的测序仪通量比较高,本发明的两组384种标签,相比华大制造的120种标签,更能满足多样本的混合测序通量需求。
表6:华大智造的8平衡的8种双端标签序列组合
组合编号 标签1编号 标签1序列 标签2编号 标签2序列
MGI001 1(SEQ ID NO:777) atgcatctaa 120(SEQ ID NO:785) tagaggacaa
MGI002 2(SEQ ID NO:778) agctctggac 119(SEQ ID NO:786) cctagcgaat
MGI003 3(SEQ ID NO:779) ctatcacgtg 118(SEQ ID NO:787) gtagtcatcg
MGI004 4(SEQ ID NO:780) ggactagtgg 117(SEQ ID NO:788) gctgagctgt
MGI005 5(SEQ ID NO:781) gccaagtcca 116(SEQ ID NO:789) aacctagata
MGI006 6(SEQ ID NO:782) cctgtcaagc 115(SEQ ID NO:790) ttgccatctc
MGI007 7(SEQ ID NO:783) tagaggtctt 114(SEQ ID NO:791) agatcttgcg
MGI008 8(SEQ ID NO:784) tatggcaact 113(SEQ ID NO:792) cgctatcggc
表7
文库编号 文库产量 文库编号 文库产量
MGI001 1328 MDI001 1386
MGI002 1251 MDI002 1255
MGI003 1196 MDI003 1229
MGI004 1267 MDI004 1311
MGI005 667 MDI005 1307
MGI006 1345 MDI006 1238
MGI007 1257 MDI007 1233
MGI008 1344 MDI008 1274
从上述实施例的描述可以看出,本申请通过在MGI测序平台上引入双端文库标签,通过样本两端的标签序列进行数据拆分,达到消除合成、实验环节和上机测序过程中导致的串扰问题,会使检测结果更准确。此外,针对MGI测序平台的泡状接头,本申请通过对该特殊结构进行测试优化,发现了泡状接头中间不配对区域为30±5bp,配对碱基为20±2bp时,效果最优,这样组成的泡状接头退火最稳定,同时,相应的扩增引物为延长的扩增引物,能够兼容单端标签的扩增子和分子标签接头。该组成结构的泡状接头与延长的扩增引物(带双端文 库标签)组合使用进行文库构建时,能够兼容现有的MGI平台的单端标签解决方案的模块,方便MGI测序仪测序应用。
在上述基础上,本申请为了更好地安排上机后的数据拆分,优化了4平衡和8平衡各384种标签序列,为MGI测序仪的高通量版测序和上机数据拆分提供最优解决方案。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种双端文库标签组合物,其特征在于,所述双端文库标签组合物包括:多个5’端的文库标签和多个3’端的文库标签,多个所述的5’端的文库标签长度均相同,多个所述3’端的文库标签的长度均相同,且在所述双端文库标签组合物中,相同位置上每种碱基出现的次数相同。
  2. 根据权利要求1所述的双端文库标签组合物,其特征在于,所述多个所述5’端的文库标签的长度与多个所述3’端的文库标签的长度相同,优选均为6~10bp之间的任一固定长度;
    优选地,所述双端文库标签组合物中,任意两种所述文库标签间至少存在3个碱基差异;且任意一种所述文库标签中连续相同的碱基数目不超过3个;
    优选地,任意一种所述文库标签的GC含量为40~60%;
    优选地,所述双端文库标签组合物包括4标签平衡的双端文库标签的组合,或8标签平衡的双端文库标签的组合,其中,所述4标签平衡的双端文库标签的组合为4n个所述5’端的文库标签和4n个所述3’端的文库标签的组合,所述8标签平衡的的双端文库标签的组合为8n个所述5’端的文库标签和8n个所述3’端的文库标签的组合,其中,n为大于等于1的自然数。
  3. 根据权利要求2所述的双端文库标签组合物,其特征在于,所述4标签平衡的双端文库标签的组合中,所述5’端的文库标签选自表1所示的96组中的任意一组或多组,所述3’端的文库标签组选自表1所示的96组不同于所述5’端的文库标签组的任意一组或多组。
  4. 根据权利要求2所述的双端文库标签组合物,其特征在于,所述8标签平衡的双端文库标签的组合中,所述5’端的文库标签选自表2所示的48组中的任意一组或多组,所述3’端的文库标签组选自表2所示的48组不同于所述5’端的文库标签组的任意一组或多组。
  5. 一种基于MGI测序平台的带双端文库标签的扩增引物组合物,其特征在于,所述扩增引物组合物包括多个带双端文库标签的扩增引物对的组合,每个所述扩增引物对包括5’端的文库标签和3’端的文库标签,
    多个所述扩增引物对的5’端的文库标签的长度均相同,多个所述扩增引物对的3’端的文库标签的长度均相同,且相同位置上每种碱基出现的次数相同。
  6. 根据权利要求5所述的扩增引物组合物,其特征在于,多个所述扩增引物对的5’端的文库标签的长度与多个所述扩增引物对的3’端的文库标签的长度相同;
    优选地,所述5’端的文库标签和所述3’端的文库标签的长度均为6~10bp之间的任一固定长度;
    优选地,所述扩增引物组合物中,任意两种文库标签之间至少存在3个碱基差异;且任意一种文库标签中连续相同的碱基数目不超过3个;
    优选地,多个所述5’端的文库标签和多个所述3’端的文库标签的GC含量均为 40~60%;
    优选地,所述扩增引物组合物包括4标签平衡的4n个扩增引物对的组合,或者为8标签平衡的8n个扩增引物对的组合,n为大于等于1的自然数。
  7. 根据权利要求6所述的扩增引物组合物,其特征在于,所述4标签平衡的4n个扩增引物对中,所述5’端的文库标签选自表1所示的96组中的任意一组或多组,所述3’端的文库标签组选自表1所示的96组不同于所述5’端的文库标签组的任意一组或多组;
    优选地,所述8标签平衡的8n个扩增引物对中,所述5’端的文库标签选自表2所示的48组中的任意一组或多组,所述3’端的文库标签组选自表2所示的48组不同于所述5’端的文库标签组的任意一组或多组。
  8. 根据权利要求5至7中任一项所述的扩增引物组合物,其特征在于,每个所述扩增引物对还包括5’端通用扩增序列和3’端通用扩增序列,所述5’端通用扩增序列包括位于所述5’端的文库标签上游的通用序列和位于所述5’端的文库标签下游的通用序列,所述3’端通用扩增序列包括位于所述3’端的文库标签上游的通用序列和位于所述3’端的文库标签下游的通用序列;
    优选地,位于所述5’端的文库标签上游的通用序列为SEQ ID NO:793,位于所述5’端的文库标签下游的通用序列为SEQ ID NO:794;位于所述3’端的文库标签上游的通用序列为SEQ ID NO:795,位于所述3’端的文库标签下游的通用序列为SEQ ID NO:796;或者
    位于所述5’端的文库标签上游的通用序列为SEQ ID NO:793,位于所述5’端的文库标签下游的通用序列为SEQ ID NO:797;位于所述3’端的文库标签上游的通用序列为SEQ ID NO:795,位于所述3’端的文库标签下游的通用序列为SEQ ID NO:798。
  9. 一种测序文库构建试剂盒,其特征在于,所述试剂盒包括权利要求5至8中任一项所述的扩增引物组合物。
  10. 根据权利要求9所述的试剂盒,其特征在于,所述试剂盒还包括泡状接头,所述泡状接头包括第一接头序列和第二接头序列,所述第一接头序列为SEQ ID NO:769,所述第二接头序列为SEQ ID NO:770,或者所述第一接头序列为SEQ ID NO:773,所述第二接头序列为SEQ ID NO:774。
  11. 一种基于MGI测序平台的测序文库的构建方法,其特征在于,所述方法采用权利要求9或10所述的试剂盒进行构建。
  12. 一种测序文库,其特征在于,所述测序文库包括权利要求1至4中任一项所述的双端文库标签组合物,或者权利要求5至8中任一项所述的扩增引物组合物。
PCT/CN2020/139919 2020-08-19 2020-12-28 双端文库标签组合物及其在mgi测序平台中的应用 WO2022036977A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20947818.9A EP3998343B1 (en) 2020-08-19 2020-12-28 Double-ended library label composition and application thereof in mgi sequencing platform
JP2023511829A JP2023538561A (ja) 2020-08-19 2020-12-28 ペアエンドライブラリーラベル組成物及びそれのmgiシーケンシングプラットフォームにおける使用

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010838955.X 2020-08-19
CN202010838955.XA CN111910258B (zh) 2020-08-19 2020-08-19 双端文库标签组合物及其在mgi测序平台中的应用

Publications (1)

Publication Number Publication Date
WO2022036977A1 true WO2022036977A1 (zh) 2022-02-24

Family

ID=73279440

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139919 WO2022036977A1 (zh) 2020-08-19 2020-12-28 双端文库标签组合物及其在mgi测序平台中的应用

Country Status (4)

Country Link
EP (1) EP3998343B1 (zh)
JP (1) JP2023538561A (zh)
CN (1) CN111910258B (zh)
WO (1) WO2022036977A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116004763A (zh) * 2022-07-19 2023-04-25 纳昂达(南京)生物科技有限公司 一种组合型接头的选择验证和质控方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112359101B (zh) * 2020-11-13 2023-10-03 苏州金唯智生物科技有限公司 一种质检寡核苷酸交叉污染的方法
CN112708619B (zh) * 2020-12-30 2022-05-17 纳昂达(南京)生物科技有限公司 Mgi平台的建库用接头、试剂盒及建库方法
CN112708622A (zh) * 2021-02-01 2021-04-27 深圳裕康医学检验实验室 一种用于文库构建的接头引物组合及其试剂盒
CN113005121B (zh) * 2021-04-25 2022-12-06 纳昂达(南京)生物科技有限公司 接头元件、试剂盒及其相关应用
CN113999893B (zh) * 2021-11-09 2022-11-01 纳昂达(南京)生物科技有限公司 兼容双测序平台的建库元件、试剂盒及建库方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109971827A (zh) * 2019-03-25 2019-07-05 纳昂达(南京)生物科技有限公司 血浆dna的建库方法和建库试剂盒
CN110628882A (zh) * 2019-10-18 2019-12-31 纳昂达(南京)生物科技有限公司 用于pcr扩增的引物、试剂盒、检测msi状态的方法及应用
CN111534518A (zh) * 2020-05-18 2020-08-14 纳昂达(南京)生物科技有限公司 通用封闭序列及其应用

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107124888B (zh) * 2014-11-21 2021-08-06 深圳华大智造科技股份有限公司 鼓泡状接头元件和使用其构建测序文库的方法
CN109486811B (zh) * 2018-09-25 2021-07-27 华大数极生物科技(深圳)有限公司 双端分子标签接头及其用途和带有该接头的测序文库
CN113957123A (zh) * 2018-11-09 2022-01-21 广州燃石医学检验所有限公司 一种构建和检测含有独特双端文库标签组合的gDNA文库的方法
CN111286527B (zh) * 2018-12-10 2024-03-19 深圳华大智造科技股份有限公司 合成dna纳米球互补链的方法和测序方法
WO2020118596A1 (zh) * 2018-12-13 2020-06-18 深圳华大生命科学研究院 标签序列的检测方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109971827A (zh) * 2019-03-25 2019-07-05 纳昂达(南京)生物科技有限公司 血浆dna的建库方法和建库试剂盒
CN110628882A (zh) * 2019-10-18 2019-12-31 纳昂达(南京)生物科技有限公司 用于pcr扩增的引物、试剂盒、检测msi状态的方法及应用
CN111534518A (zh) * 2020-05-18 2020-08-14 纳昂达(南京)生物科技有限公司 通用封闭序列及其应用

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116004763A (zh) * 2022-07-19 2023-04-25 纳昂达(南京)生物科技有限公司 一种组合型接头的选择验证和质控方法
CN116004763B (zh) * 2022-07-19 2024-02-09 纳昂达(南京)生物科技有限公司 一种组合型接头的选择验证和质控方法

Also Published As

Publication number Publication date
EP3998343A4 (en) 2022-11-02
JP2023538561A (ja) 2023-09-08
EP3998343A1 (en) 2022-05-18
EP3998343B1 (en) 2024-03-20
CN111910258A (zh) 2020-11-10
EP3998343C0 (en) 2024-03-20
CN111910258B (zh) 2021-06-15

Similar Documents

Publication Publication Date Title
WO2022036977A1 (zh) 双端文库标签组合物及其在mgi测序平台中的应用
CN108893466A (zh) 测序接头、测序接头组和超低频突变的检测方法
CN105506125B (zh) 一种dna的测序方法及一种二代测序文库
CN109971827B (zh) 血浆dna的建库方法和建库试剂盒
JP2019523638A (ja) 遺伝子突然変異を検出するマルチポジショニングダブルタグアダプターセット、及びその調製方法と応用
CN108998508B (zh) 扩增子测序文库的构建方法及引物组和试剂盒
CN113005121B (zh) 接头元件、试剂盒及其相关应用
CN108517567B (zh) 用于cfDNA建库的接头、引物组、试剂盒和建库方法
WO2021227129A1 (zh) 一种通用型高通量测序接头及其应用
WO2014023167A1 (zh) 检测α珠蛋白基因拷贝数的方法和***
CN109486811A (zh) 双端分子标签接头及其用途和带有该接头的测序文库
US20230348958A1 (en) Genome-scale imaging of the 3d organization and transcriptional activity of chromatin
CN108203847A (zh) 用于二代测序质量评估的文库、试剂及应用
CN109706219A (zh) 构建测序文库的方法、试剂盒、上机方法及测序数据的拆分方法
WO2020007089A1 (zh) 一种同时检测多种肝癌常见突变的ctDNA文库构建和测序数据分析方法
CN113373524B (zh) 一种ctDNA测序标签接头、文库、检测方法和试剂盒
CN110219054B (zh) 一种核酸测序文库及其构建方法
US20210102246A1 (en) Genetic test for detecting congenital adrenal hyperplasia
WO2012037875A1 (zh) Dna标签及其应用
CN113337501A (zh) 一种发卡型接头及其在双端index建库中的应用
TW201321520A (zh) 用於病毒檢測的方法和系統
WO2024037449A1 (zh) 一种高通量构建rna测序文库的方法及试剂盒
WO2023082305A1 (zh) 兼容双测序平台的建库元件、试剂盒及建库方法
CN108753922A (zh) 一种构建转录组测序文库的方法及相应的接头序列和试剂盒
WO2019010776A1 (zh) 组合标签、接头及确定含有低频突变核酸序列的方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020947818

Country of ref document: EP

Effective date: 20220211

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20947818

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023511829

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE