CN105925671B

CN105925671B - A method of target sequence nucleotides are enriched with from nucleic acid samples

Info

Publication number: CN105925671B
Application number: CN201610250133.3A
Authority: CN
Inventors: 蔡万世; 王瑞超; 屈武斌; 杭兴宜
Original assignee: Aiji Taikang (jiaxing) Biotechnology Co Ltd
Current assignee: Aiji Taikang (Jiaxing) Biotechnology Co., Ltd.
Priority date: 2016-04-22
Filing date: 2016-04-22
Publication date: 2019-07-23
Anticipated expiration: 2036-04-22
Also published as: AU2016403554A1; WO2017181670A1; CN105925671A; AU2016102398A4

Abstract

The present invention provides a kind of methods from nucleic acid samples enrichment target sequence nucleotides, which comprises provide the nucleic acid samples comprising target nucleic acid sequence and it is consistent with target nucleic acid sequence or to target sequence with characteristic bait sequences；In-vitro transcription is carried out as template using the bait sequences and prepares nucleic acid analog, and the nucleic acid analog has bound fraction；Make the nucleic acid sample fragment；The nucleic acid analog hybridizes with the nucleic acid samples, so that the nucleic acid analog and the target sequence nucleotides form nucleic acid analog/DNA hybridization compound；By the bound fraction, the nucleic acid analog/DNA hybridization compound is separated from non-specific hybridization nucleic acid, removes non-targeted sequencing nucleic acid.In preferred embodiments, the method also includes expanding the nucleic acid analog/DNA hybridization compound, achieve the purpose that be enriched with target sequence nucleotides.

Description

A method of target sequence nucleotides are enriched with from nucleic acid samples

Technical field

The present invention relates to the capture of nucleic acid sequence, enrichment and analyses.More specifically, the present invention relates to captured based on liquid phase Target sequence enrichment method.

Background technique

Genome sequencing can obtain mutation, insertion, missing and the structure variation of full-length genome horizontal extent.So And since gene pool-size is larger, with 30 × carry out sequencing and will generate data volume close to 100G.And tumour etc. is relevant low Frequency of mutation sequencing then needs at least 1000 × coverage can generate up to 3000G if carrying out genome sequencing Data volume.In this way the data volume of scale in addition to the analysis work of data can be caused it is greatly difficult other than, can also make sequencing at This is huge.This when, target area capture technique come into being.

Target area capture technique refers to the nucleic acid sequence of the capture target area oriented by specific technological means, so After build library sequencing, to achieve the purpose that drop sequencing cost significantly while carrying out deep sequencing to target area It is low.PCR is a kind of technology for being commonly used in enrichment target area, more commonly disposably using multiple PCR technique Capture multiple target areas.Multiplex PCR is more suitable for the capture of hot spot region or the lesser target area of length；For length Biggish target area, such as length are more than the target area of 100K, and multiplex PCR comes from its cost and technology complexity It sees, is all no longer appropriate for.

Therefore, there is a need in the art for the new methods for being suitble to capture the biggish target area of length.

Summary of the invention

To solve the above-mentioned problems, the present invention provides a kind of target sequence enrichment methods based on liquid phase capture.

In a first aspect, the present invention provides it is a kind of from nucleic acid samples enrichment target sequence nucleotides method, the method Include:

A) nucleic acid samples comprising target nucleic acid sequence and consistent with target nucleic acid sequence or have to target sequence are provided Characteristic bait sequences；

B) in-vitro transcription being carried out as template using the bait sequences and preparing nucleic acid analog, the nucleic acid analog is with knot Close part；

C) make the nucleic acid sample fragment, preferably prepare library；

D) nucleic acid analog hybridizes with the nucleic acid samples, so that the nucleic acid analog and the target sequence core Acid forms nucleic acid analog/DNA hybridization compound；

E) by the bound fraction, it is compound that the nucleic acid analog/DNA hybridization is separated from non-specific hybridization nucleic acid Object removes non-targeted sequencing nucleic acid.

In one embodiment, in the nucleic acid sample fragment both ends jointing sequence in the preparation library of step c) Column, and further include step f) in step e) and the nucleic acid analog/DNA hybridization compound is carried out according to the joint sequence Amplification achievees the purpose that be enriched with target sequence nucleotides.

In one embodiment, wherein the bait sequences have characteristic chosen from the followings: i) itself does not generate hair clip Structure and generated between each other without dimer, ii) copy number ties according to the G/C content of the target nucleic acid sequence and/or space Structure compensates and iii) when the target area is high or extremely low G/C content region or when target area is low multiple When miscellaneous degree region, the target area two side areas is used to design bait, design method and the target area as replacement area Unanimously, iv) with the other sequences in nucleic acid samples except target nucleic acid sequence without specific binding.

In one embodiment, the copy number of the bait sequences is also according to the concerned situation of the target nucleic acid sequence It compensates.

In one embodiment, wherein the nucleic acid samples are genomic DNA, RNA, cDNA, mRNA, in the nucleic acid In the case that sample is RNA or mRNA, there is RNA the or mRNA reverse transcription the step of at DNA before middle step c).

In one embodiment, the bait sequences on a solid carrier, such as on microarray slide.

In one embodiment, the solid carrier is also a variety of pearls or is microarray.

In one embodiment, some or all of nucleic acid analog has bound fraction.

In one embodiment, it is carried out in step b) using nucleic acid analog GNA, LNA, PNA, TNA or morpholine nucleic acid It is transcribed in vitro, prepares nucleic acid analog, the preferably described nucleic acid analog has bound fraction.

In one embodiment, wherein the bound fraction is biotin binding species.

In one embodiment, the bait sequences copy number is mended according to the G/C content of the target sequence It repays, G/C content is smaller or bigger, and the corresponding bait sequences copy number of the target sequence is increased more.

In one embodiment, copy number compensates according to the G/C content of the target nucleic acid sequence and refers to: with GC Content is benchmark 1 in 50% bait sequences copy number coefficient, and G/C content deviates 50% every 1% between 10%-90%, bait Sequence copy numbers coefficient increases 0.08-0.12.

In a specific embodiment, bait sequences copy number compensation method are as follows: contained according to the GC of the target sequence Amount size is divided into 6 grades from high to low, wherein the 1st grade: 10%-30%；2nd grade: 30%-40%；3rd grade: 40%-60%；4th Shelves: 60%-70%；5th grade: 70%-90%；6th grade: less than 10% or be greater than 90%, wherein the 3rd grade of bait sequences are copied Shellfish number is benchmark copy number, and the copy number of the 2nd grade and the 4th grade of bait sequences is more than the 3rd grade, the 2.2-2.8 of the e.g. the 3rd gear Times, the copy number of the 1st grade and the 5th grade of bait sequences is more, and 3-4 times of the e.g. the 3rd gear.For the 6th grade, G/C content is less than 10% or it is greater than 90% and target area the case where being low complex sequence, bait sequences design method is: with the target area Domain two side areas designs probe as replacement area, is typically chosen target area two sides 300bp using inner region as replacement area, It is preferred that the region within 150bp.

In one embodiment, wherein the bait sequences length is 60-150bp, preferably 80-120bp.

In one embodiment, wherein it is described consistent with target nucleic acid sequence or be with specificity to target sequence Refer to, the thermodynamic stability that bait sequences combine on nontarget area will be significantly smaller than the thermodynamics combined on target area Stability, preferably with target area T_mWith non-specific region T_m>=5 DEG C, more preferably with target area T_mWith non-specific region T_m ≥10℃；It is preferred that the value of Tm is calculated based on the nearest neighbor algorithm of 2007 thermodynamic parameter table of SantaLucia.

In one embodiment, it wherein the no dimer generation refers to, is formed between any two bait sequences Dimer, T_m≤ 47 DEG C, preferably≤37 DEG C；It is preferred that the value of Tm is based on the most adjacent of 2007 thermodynamic parameter table of SantaLucia Nearly method calculates.

In one embodiment, wherein the no hairpin structure generation refers to that any bait sequences itself form hair fastener Structure, T_m≤ 47 DEG C, preferably≤37 DEG C；It is preferred that the value of Tm is based on the closest of 2007 thermodynamic parameter table of SantaLucia Method calculates.

In one embodiment, wherein to each target area, the bait sequences are in specificity, dimer, hair Card structure and one or more bait sequences optimal with the relative position aspect comprehensive score of target area, the synthesis Scoring is carried out by following scoring functions: S=a × S_Specificity+b×S_Dimer+c×S_{Hairpin structure}+d×S_{Relative distance}, wherein a=0.26- 0.34, b=0.08-0.12, c=0.17-0.23, d=0.35-0.45, specific calculation method of giving a mark are as follows:

S_SpecificityMarking calculate: to newly-designed any bar bait sequences, sequence alignment is carried out to it in the genome, it is right The sequence that its each compares calculates separately Tm between the bait sequences and the sequence compared, the bait sequences and mesh Mark region T_mIt compares upper sequence T with any_mDifference >=5 DEG C preferably >=10 DEG C calculate the bait sequences and all compare Sequence between average Tm, S_Specificity=1-Tm_{Average value}/(Tm_Target- 5), preferably S_Specificity=1-Tm_{Average value}/(Tm_Target- 10), wherein Tm_{Average value}It is the average Tm value of bait sequences Yu all non-specific region comparison results, Tm_TargetIt is bait sequences and target area T_m

S_DimerMarking calculate: to newly-designed any bar bait sequences, the bait sequences designed with each into Row dimer compares analysis, and the sequence compared to its each calculates separately the bait sequences and the bait compared Tm between sequence, the T_m47 DEG C of <, the average Tm between the bait sequences and all bait sequences compared is calculated, S_Dimer=(47-Tm_{Average value})/47, the preferably described T_m37 DEG C of <, calculate the bait sequences and all bait sequences compared it Between average Tm, S_Dimer=(37-Tm_{Average value})/37；

S_{Hairpin structure}Marking calculate: to any bar bait sequences, calculate its it is optimal itself compare structure, and calculate described in The Tm of structure, the T_m47 DEG C of <, and S_{Hairpin structure}=(47-Tm)/47, it is 37 DEG C of Tm < preferably described, and S_{Hairpin structure}=(37- Tm_{Average value})/37；

S_{Relative distance}Marking calculate: itself and institute are calculated to newly-designed any bar bait sequences for target area coordinates State target area coordinates value of delta_Distance, δ_DistanceLess than 150, S_{Relative distance}=(150- δ_Distance)/150。

In second aspect, the present invention also provides the specific bait sequences for implementing method of the invention, the specificity Bait sequences are the bait sequences being related in first aspect present invention.

In one embodiment, the specific bait sequences are consistent with target nucleic acid sequence or have to target sequence Characteristic, and i) itself do not generate hairpin structure and generated between each other without dimer, ii) copy number is according to the target The G/C content and/or space structure of nucleic acid sequence compensate, iii) when the target area is high or extremely low G/C content When region or when target area is low complex degree region, uses the target area two side areas to design as replacement area and visit Needle, design method is consistent with the target area, iv) with the other sequences except target nucleic acid sequence in nucleic acid samples without special Property combine.

In the third aspect, the present invention also provides a kind of kit, the kit includes described in second aspect of the present invention Bait sequences, the kit further includes, but is not limited to, double-stranded adapters molecule, a variety of different oligonucleotide probes.

In one embodiment, the kit include for realizing first aspect present invention method composition and Reagent.The kit includes, but are not limited to double-stranded adapters molecule, a variety of different oligonucleotide probes and target nucleic acid sequence Column are consistent or have characteristic bait sequences to target sequence, the bait sequences: itself do not generate i) hairpin structure and Being generated between each other without dimer, ii) copy number according to the G/C content of the target nucleic acid sequence, space structure and/or closed Note situation compensates, iii) when the target area is high or extremely low G/C content region or when target area is low When complexity region, the target area two side areas is used to design probe, design method and the target area as replacement area Domain is consistent, iv) with the other sequences in nucleic acid samples except target nucleic acid sequence without specific binding.In certain embodiments In, kit includes two kinds of different double-stranded adapters molecules.The kit can further include it is at least one or more of other at Point, the other compositions are selected from archaeal dna polymerase, T4 polynucleotide kinase, T4 DNA ligase, hybridization solution, cleaning solution and/or wash De- liquid.In certain embodiments, the kit includes magnet.In certain embodiments, the kit includes one kind Or a variety of enzymes, and corresponding reagent, buffer etc., such as restriction enzyme, such as MlyI, and for using MlyI into Buffer/reagent of row restriction endonuclease reaction.

Specific embodiment

The present invention provides a kind of target sequence enrichment method based on liquid phase capture, described includes: bait sequences design, The nucleic acid synthesis (with the method for synthesis custom primer or synthesis in solid state) of bait sequences, prepares nucleic acid with the method for in-vitro transcription Like object, the nucleic acid analog includes bound fraction；Nucleic acid samples pre-treatment (is carried out) by the method for library preparation, and sample can be with It is genomic DNA, RNA, cDNA, mRNA etc.；Nucleic acid analog and target sequence nucleotides are with complementary pairing principle formation nucleic acid Like object/DNA hybridization compound；Elution removes nucleic acid analog/DNA hybridization body of low complementary pairing, removes non-targeted Sequence kernel Acid；According to joint sequence added by nucleic acid samples pre-treatment, specific amplification is carried out to nucleic acid analog/DNA of complementary pairing, Achieve the purpose that be enriched with target sequence nucleotides.

In invention, term " sample " is used with its widest meaning, is intended to include from any source, preferably from life The sample or culture that object source obtains.Biological sample can be obtained from animal (including people), and including liquid, solid, tissue and Gas.Biological sample includes blood product, such as blood plasma, serum etc..Therefore, " nucleic acid samples " include the nucleic acid in any source (such as DNA, RNA, cDNA, mRNA, tRNA, miRNA etc.).In the case where the nucleic acid samples are RNA or mRNA, middle step C) there is RNA the or mRNA reverse transcription the step of at DNA before.In this application, nucleic acid samples preferably originate from biological source, Such as people or non-human cell, tissue etc..Term " inhuman " means all non-human animals and entity, including but not limited to, vertebra Animal such as rodent, non-human primate, sheep, ox, ruminant, lagomorph, pig, goat, horse, dog, cat, birds Etc..Inhuman further includes invertebrate and prokaryotes, such as bacterium, plant, yeast, virus etc..Therefore, it is used for this hair The nucleic acid samples of bright method and system be from any biology, no matter the nucleic acid samples of eukaryon or protokaryon.

In invention, inventor has found that the G/C content of target sequence has the target sequence capture rate captured based on liquid phase Larger impact.In order to reach effective capture to multiple target sequences, preferably according to the G/C content of the target sequence to described Bait sequences copy number compensates, and G/C content is smaller or bigger, and the corresponding bait sequences copy number of the target sequence increases What is added is more.

Inventors have found that for G/C content 50% or so, such as ± 10%, target sequence can obtain good mesh Mark sequence capturing efficiency；For the target sequence of other G/C contents, need to carry out bait sequences copy number compensation could obtain it is good Good target sequence capture rate.By being tested with human genomic sequence comprehensively, inventors have found that more preferable in order to reach Target sequence capture rate, using G/C content 50% bait sequences copy number coefficient as benchmark 1, G/C content 10%-90% Between deviate 50% every 1%, bait sequences copy number coefficient increase 0.08-0.12.For example, deviateing when G/C content is 68% 18%, induced sequence copy number coefficient is 2.44-3.16.

The case where low complex sequence is belonged to less than 10% or greater than 90% for G/C content, in this case corresponding bait Sequence design methodology is: when the target area is high or extremely low G/C content region or when target area is low complexity When spending region, uses the target area two side areas to design probe as replacement area, be typically chosen target area two sides 300bp Region using inner region as replacement area, within preferably 150bp.

In the present invention, low complex degree region refers to an area as composed by the element (such as oligonucleotides) of seldom type Domain, such as this simple repeated sequence of microsatellite.

In the present invention, it is preferred to carry out building library to the sample dna fragment after fragmentation.

In one embodiment, the compensation method of bait sequences copy number can be expressed simply as: according to the target The G/C content size of sequence is divided into 6 grades from high to low, wherein the 1st grade: 10%-30%；2nd grade: 30%-40%；3rd grade: 40%-60%；4th grade: 60%-70%；5th grade: 70%-90%；6th grade: less than 10% or be greater than 90%, wherein the 3rd grade The copy numbers of bait sequences be benchmark copy number, the copy number of the 2nd grade and the 4th grade corresponding bait sequences needs to increase, example The copy number of 2.2-2.8 times, the 1st grade and the 5th grade of bait sequences of the 3rd gear in this way needs to increase more, the e.g. the 3rd gear 3-4 times.In one embodiment, for the 6th grade, G/C content is low complexity less than 10% or greater than 90% or in G/C content The case where sequence, bait sequences design method is: using the target area two side areas to design probe as replacement area, generally Region of the selection target region two sides 300bp using inner region as replacement area, within preferably 150bp.

In one embodiment, wherein to each target area, the bait sequences are in specificity, dimer, hair Card structure and one or more bait sequences optimal with the relative position aspect comprehensive score of target area, the synthesis Scoring is carried out by following scoring functions: S=a × S_Specificity+b×S_Dimer+c×S_{Hairpin structure}+d×S_{Relative distance}, wherein a=0.26- 0.34, b=0.08-0.12, c=0.17-0.23, d=0.35-0.45.S_SpecificityEqual marking are the numerical value between 0 to 1, specifically Marking calculation method it is as follows:

S_SpecificityMarking rule: to newly-designed any bar bait sequences, sequence alignment is carried out to it in the genome, is adopted With BLAT software, thermodynamics Tm parameter is calculated separately to its each comparison result using default parameters, if there is with target Region T_mWith non-specific region T_m5 DEG C of difference <, preferably 10 DEG C of < then abandons the bait sequences, redesign；Otherwise institute is calculated There are the average Tm value of non-specific region comparison result, final S_Specificity=1-Tm_{Average value}/(Tm_Target- 5), wherein it is preferred that S_Specificity=1- Tm_{Average value}/(Tm_Target- 10), wherein Tm_{Average value}It is the average Tm value of bait sequences Yu all non-specific region comparison results, Tm_TargetIt is Bait sequences and target area T_m；

S_DimerMarking rule: to newly-designed any bar bait sequences, the bait sequences designed with each into Row dimer compares analysis, using BLAT software, calculates separately thermodynamics to its each comparison result using default parameters Tm parameter, if there is T_m>=47 DEG C, then the bait sequences are abandoned, redesigned；Otherwise the average Tm of all comparison results is calculated Value, final S_Dimer=(47-Tm_{Average value})/47；It is preferred that if there is T_m≥37.DEG C, then the bait sequences are abandoned, are redesigned；Otherwise Calculate the average Tm value of all comparison results, S_Dimer=(37-Tm_{Average value})/37；

S_{Hairpin structure}Marking rule: it is optimal that its is calculated using Smith-Waterman algorithm to any bar bait sequences Itself compares structure, and calculates its thermodynamics Tm parameter value according to this configuration, if there is T_m>=47 DEG C, then abandon the bait sequence Column redesign；Otherwise its S_{Hairpin structure}=(47-Tm)/47, preferably if there is T_m>=37 DEG C, then the bait sequences are abandoned, again Design；Otherwise its S_{Hairpin structure}=(37-Tm_{Average value})/37；

S_{Relative distance}Marking rule: known target area coordinates to be designed calculate itself and target to any bar bait sequences Area coordinate value of delta_Distance, acceptable difference is set as 150, which is empirical value；If difference is greater than 150, The bait sequences are abandoned, are redesigned；Otherwise its S_{Relative distance}=(150- δ_Distance)/150.With target area coordinates difference 150 Suitable bait sequences can not be designed in range, can also set 300, S for difference_{Relative distance}=(300- δ_Distance)/ 300。

In the present invention, the T of sequence_mCalculating be not limited to specific method, the Tm value that various methods calculate can be with For the present invention, the Tm value that various methods obtain cannot reverse effect of the invention substantially, and only the degree of effect can be variant. Although the nearest neighbor algorithm of 2007 thermodynamic parameter table of SantaLucia can calculate Tm, the Tm value that other methods calculate can be with It corresponds, the Tm that those skilled in the art can be calculated by the simple more various methods of test, thus to each The Tm value that kind method calculates makes appropriate selection.

According to the experience of inventor, for human genome code area, the target area more than 99% can be designed It is suitble to bait sequences of the invention out, shows that our aforementioned steppings to the region GC and the filtering to Tm value are all reasonable.

In certain embodiments, hybridizing under preferably stringent condition between the nucleic acid analog and target nucleic acid It carries out, the stringent condition is enough to support the hybridization between the nucleic acid analog/DNA, wherein the nucleic acid analog includes The complementary region of compound and the target nucleic acid sample is connected, to provide the nucleic acid analog/DNA hybridization compound.Institute It states compound and then passes through the connection compound capture, and washed under conditions of being enough and removing ergotropy combination nucleic acid, Then the target nucleic acid sequence hybridized is eluted from the nucleic acid analog captured /DNA compound.

In certain embodiments, the nucleic acid analog includes chemical group or connection compound, such as bound fraction Such as biotin, digoxin etc., solid carrier can be incorporated into.The solid carrier may include corresponding capture chemical combination Object, such as the Streptavidin for biotin or the DigiTAb for digoxin.Connect the present invention is not limited to used Compound is connect, and the connection compound substituted is equally applicable to method of the invention, bait sequences and kit.

In the present invention, the chemical group or connection compound, such as bound fraction such as biotin, digoxin etc. Deng can connect in nucleic acid analog (glycerol nucleic acid GNA, lock nucleic acid LNA, peptide nucleic acid PNA, threose nucleic acid TNA or morpholine core Acid) in any base.It preferably, may include ribose and/or deoxyribose, the chemistry base in the nucleic acid analog chain Group or connection compound, such as bound fraction such as biotin, digoxin etc., can connect in ribose and/or deoxyribose On base on.For example, including ATP, CTP, GTP and/or UTP using label in the nucleic acid analog synthesis.Label is used The labeling method of nucleotide Cydye, DIG, biotin, rhodamine, fluorescein etc. is known in the art.For example, biotin can For use as labeled nucleic acid probe object, it can be combined with the C atom on the 5 ' position the UTP of nucleic acid molecules or dUTP, and can with it is affine Element in conjunction with and be detected.However, the present invention is not limited to known marker and labeling method, the marker and label of future discovery Method is also within the scope of consideration of the invention.

In embodiments of the invention, the multiple target nucleic acid molecules preferably comprise a kind of biology full-length genome or At least one chromosome or a kind of nucleic acid molecules of arbitrary size molecular weight.Preferably, the size of the nucleic acid molecules is at least about 200kb, at least about 500kb, at least about 1Mb, at least about 2Mb or at least about 5Mb, more preferable size about 100kb to about 5Mb, about 200kb to about 5Mb, about 500kb are to about 5Mb, about 1Mb to about 2Mb or about 2Mb to about 5Mb.

In certain embodiments, the target nucleic acid comes from animal, plant or microorganism, in preferred embodiment In, the target nucleic acid molecules choosing comes from people.If fewer (such as the people's nucleic acid obtained in some cases of the amount of nucleic acid samples Sample, such as the genome of developmental fetus), the amplifiable nucleic acid before implementing the method for the invention, such as pass through Whole genome amplification.To carry out method of the invention, amplification may be necessary in advance, such as in legal medical expert's application (such as Hereditary feature purpose is used in medical jurisprudence).

In certain embodiments, the multiple target nucleic acid molecules are one group of genomic DNA molecule.The bait sequences It can be selected from the multiple bait sequences for for example limiting a variety of exons from multiple genetic locis, introne or regulating and controlling sequence； Multiple bait sequences of the complete sequence of at least one particulate inheritance locus are limited, the locus size is any, preferably at least One of 1Mb or at least the above particular size；Limit a variety of bait sequences of single nucleotide polymorphism (SNP)；Or limit a kind of battle array A variety of bait sequences of column, such as it is designed as the tiling array of the complete sequence of at least one complete chromosome of capture.

Herein, term " hybridization " means the pairing of complementary nucleic acid.Hybridization and intensity for hybridization (such as combine between nucleic acid Intensity) influenced by many factors, such as degree complementary between nucleic acid, using hybridization conditions Stringency, formed The melting temperature (Tm) of hybrid and the G/C content value of nucleic acid.Although the present invention is not only restricted to specific hybridization conditions, excellent Choosing uses stringent hybridization conditions.Stringent hybridization conditions depend on sequence and (such as salinity, organic matter are deposited with Crossbreeding parameters Waiting) and change.In general, " strictly " condition is selected as the Tm for being lower than specific nucleic acid sequence under defined ionic strength and pH About 5 DEG C to about 20 DEG C.Preferably, stringent condition arrives for about 5 DEG C of temperature melting point lower than the specific nucleic acid for combining complementary nucleic acid 10℃.The temperature that the Tm is 50% nucleic acid (such as target nucleic acid) to be hybridized with complete pairing probe is (in defined ionic strength Under pH).

Herein, " stringent condition " may be, for example, 50% formamide, 5 × SSC (0.75M NaCl, 0.075M lemon Sour sodium), 50mM sodium phosphate (pH6.8), 0.1% sodium pyrophosphate, the salmon sperm dna of 5 × Denhardt solution, ultrasonication (50mg/ml), 0.1%SDS and 10% dextran sulfate hybridize at 42 DEG C, at 42 DEG C with 0.2 × SSC (sodium chloride/lemon Lemon acid sodium) and 55 DEG C with 50% formamide washing, then 55 DEG C with containing EDTA 0.1 × SSC washing.Such as, it is contemplated that Buffer comprising 35% formamide, 5 × SSC and 0.1% (w/v) lauryl sodium sulfate (SDS) is suitble to non-critical in appropriateness Under the conditions of 45 DEG C hybridization 16-72 hours.

Herein, term " primer " means oligonucleotides, no matter naturally occurring purified, being obtained after digestion or warp What synthetic method generated, under conditions of the synthesis for being placed in the induction primer extension product complementary with nucleic acid chains (such as in nucleosides In the presence of acid and induction agent such as archaeal dna polymerase, and at suitable temperature and pH), it can be as the starting point of synthesis.It is described Primer preferably has the single-stranded of maximum amplification efficiency.Preferably, the primer is oligodeoxynucleotide.The primer must be sufficient Enough long synthesis to cause extension products in the presence of the induction agent.The definite length of the primer depend on it is many because Element, including temperature, Primer Source and institute's application method.

Herein, no matter naturally term " bait " or " bait sequences " mean oligonucleotides (such as nucleotide sequence), It, can be with another target oligonucleotide there are purified, obtaining after digestion or generate through synthesis, recombination or PCR amplification Such as at least part hybridization of target nucleic acid sequence.Probe can be single-stranded or double-stranded.Probe can be used for specific gene sequence Detection identifies and separates.

Herein, term " target nucleic acid molecules " refers to molecule or sequence from target genome area.Pre-selection Probe has determined the range of target nucleic acid molecules.Therefore, described " target " attempts to distinguish with other nucleic acid sequences.One " segment " is defined as a nucleic acid region in the target sequence, such as one " segment " of nucleic acid sequence or " a portion Point ".

Herein, term " separation " is when for when being related to nucleic acid, such as " separation nucleic acid " when, mean nucleic acid sequence from It is authenticated and separates in at least one other components or pollutant that its natural origin usually combines.Isolated nucleic acid is not with Its naturally occurring form is same as to exist.On the contrary, the nucleic acid of unsegregated nucleic acid such as DNA and RNA are with its naturally occurring shape State exists.The isolated nucleic acid, oligonucleotides or polynucleotides can exist with single stranded form or double-stranded form.

Herein, term " with the consistent bait sequences of target nucleic acid sequence " refers to that its complementary series can be with target core The sequence of acid sequence hybridization.It is preferred that being hybridized under strict conditions.When the target area is that high or extremely low GC contains When amount region or when target area is low complex degree region, since the region can not design bait sequences, i.e. bait sequences Coverage rate is zero, then appropriate area design bait sequences can be found at left and right sides of the target area；It generally can be in left and right two Range within the 300bp of side designs bait sequences；It is preferred that the region within 150bp.

In embodiments of the invention, the bait sequences used in catching method as described herein and kit are used for Transcription primers include connection compound, such as bound fraction.Bound fraction includes any connection or introduces for then capturing The part at 5 ' ends of nucleic acid analog/target nucleic acid hybridization complex amplimer.Bound fraction is to introduce primer sequence 5 ' Any sequence at end, such as trappable 6 histidine (6HIS) sequence.For example, the primer comprising 6HIS sequence can be captured by nickel, Such as in nickel coating or the pipe, micropore or purification column that are coated with pearl, particle etc. comprising nickel, wherein the pearl is packaged into In pillar, sample is packed into and passes through pillar to capture the compound (for example, eluting with subsequent target) of complexity reduction.For The example of another bound fraction of embodiment of the present invention includes haptens, such as digoxin, such as it is connected to amplification 5 ' ends of primer.DigiTAb capture, such as coating or the matrix comprising anti digoxin antibody can be used in digoxin.

In certain embodiments, the bound fraction is biotin, is coated with the capture matrix, example with Streptavidin Such as pearl such as paramagnetic particle, for separating the target nucleic acid/transcription product compound from non-specific hybridization target nucleic acid. For example, when biotin is bound fraction, Streptavidin (SA) coated matrix, such as the coated pearl of SA (such as magnetic bead/ Particle) for capturing nucleic acid analog/target complex of the biotin labeling.Wash the compound that the SA is combined, institute The target nucleic acid of hybridization is sequenced from compound elution.

Can be used without mask array synthetic technology on a solid carrier parallel provide sequence in the genome at least one The corresponding bait sequences in a region.Alternatively, standard DNA synthesizer can be used continuously to obtain and be applied to the solid for probe Carrier, or can be obtained from organism and be fixed on the solid carrier.It is non-hybridized or non-with the nucleic acid analog after hybridization The nucleic acid of specific hybrid is separated from the carrier-bound nucleic acid analog by washing.Remaining nucleic acid and the nucleic acid Analog specific binding, in such as hot water or in the Nucleic Acid Elution buffer including, for example, TRIS buffer and/or EDTA In eluted from the solid carrier, to generate the eluate of target nucleic acid molecules enrichment.

Alternatively, the bait sequences for target molecule can synthesize on a solid carrier as described above, as bait sequences collection Conjunction is discharged and is expanded from the solid carrier.The release nucleic acid analog set of the transcription can covalently or non-covalently be fixed on load Body, such as glass, metal, ceramics or polymeric beads or other solid carriers.The nucleic acid analog may be designed as from described solid Body carrier facilitates release, for example, closest to the nucleic acid analog end of carrier or its be provided about acid or alkali labile nucleic acid Sequence discharges the nucleic acid analog under the conditions of low or high pH respectively.A variety of connection chemical combination sheared known in the art Object.The carrier can be with for example, to provide with the cylinder of liquid-inlet and outlet.It is familiar with cDNA chip to load this field The method of body, such as by the way that the nucleotide of biotin labeling to be integrated in the nucleic acid analog, and use Streptavidin It is coated with the carrier, thus the coated carrier is non-covalent attracts and fix the nucleic acid analog in the set.Institute It states sample and passes through the carrier comprising nucleic acid analog, the target core thus hybridized with the immobilization carrier under hybridization conditions Acid molecule can elute, for analysis or other purposes later.

Term " nucleic acid " may include, such as, but not limited to: DNA (DNA), ribonucleic acid (RNA) and artificial Nucleic acid such as peptide nucleic acid (PNA), morpholine nucleic acid (morpholino) and lock nucleic acid (LNA), glycerol nucleic acid (glycol nucleic Acid, GNA) and threose nucleic acid (TNA).Herein, term " nucleic acid ", " nucleic acid sequence " or " nucleic acid molecules " should be from wide Justice explain, for example, can be ribonucleic acid (RNA) or DNA (DNA) or its analogies oligomer or Person's polymer.The term includes by (skeleton) connects and composes between natural nucleobases, carbohydrate and covalent nucleosides molecule and having The molecule with similar functions or a combination thereof that (skeleton) connects and composes between non-natural nucleobase, carbohydrate and covalent nucleosides.Cause For required property, for example nucleic acid target molecule affinity is enhanced and stability increases in the presence of nuclease and other enzymes, Such nucleic acid modified or replaced may be than native form it is further preferred that and using term " nucleic acid analog " herein Or " nucleic acid mimics " describe.The preferred embodiment of nucleic acid mimics is comprising peptide nucleic acid (PNA), lock nucleic acid (LNA), wood- Lock nucleic acid Uylo-LNA), thiophosphoric acid is cruel, point of 2 '-methoxyl groups, 2 '-methoxy ethoxies, morpholine nucleic acid and phosphoramidate Sub or functionally similar nucleic acid derivative.

Embodiment

Embodiment 1: the design of bait sequences

1000 sites (distribution in these sites is shown in Table) are used on exon and introne on random selection human genome Test method of the invention.Follow-up test is used for this 1000 random target sequence design bait sequences.

Table 1: the chromosome distribution in randomly selected 1000 sites

Chromosome	Number	Chromosome	Number
				chr1	92	chr12	73
chr2	67	chr13	23
				chr3	53	chr14	15
chr4	43	chr15	29
				chr5	45	chr16	41
chr6	124	chr17	36
				chr7	42	chr18	14
chr8	46	chr19	31
				chr9	34	chr20	21
chr10	61	chr21	9
				chr11	80	chr22	21

Bait sequences design the following steps are included:

1. firstly, the analysis of target sequence characteristic includes the following steps:

A) it is divided into 5 grades from high to low according to target sequence G/C content size, wherein 1 grade: 10%-30%；2 grades: 30%- 40%；3 grades: 40%-60%；4 grades: 60%-70%；5 grades: 70%-90%；

B) target sequence space structure is analyzed, label can form the target sequence of stable space structure；

2. secondly, established standards and scoring to bait sequences:

A) target sequence length is in 60-150bp range；

B) specificity is kept, specific principle is the thermodynamic stability that bait sequences combine on nontarget area It is significantly smaller than the thermodynamic stability combined on target area；The index of general analysis is T_m(target area)-T_m(non-spy Different region) >=5 DEG C of (non-specific region)；Partial data T_m(target area)-T_m(non-specific region) >=10 DEG C compares (strong Specificity limitation)；Different thermodynamic calculation methods, are affected to calculated result, are based on 2007 heat of SantaLucia here The nearest neighbor algorithm of mechanics parameter table calculates；

C) it is generated without secondary structure, secondary structure includes dimer and hairpin structure, i.e. designed bait sequences are not permitted Perhaps dimer or hairpin structure are generated；The dimer formed between any two bait sequences, T_m≤ 47 DEG C, partial data ≤ 37 DEG C compare (stringent dimer limitation)；Any bait sequences itself form hairpin structure, T_m≤ 47 DEG C, part number (stringent hairpin structure limitation) is compared according to≤37 DEG C；Different thermodynamic calculation methods, are affected to calculated result, here It is that the nearest neighbor algorithm based on 2007 thermodynamic parameter table of SantaLucia calculates；

D) to each target area, candidate bait sequences are analyzed, according to the specificity of each candidate sequence, dimer, hair Card structure and relative position with target area, design synthesis scoring, then according to appraisal result, select optimal one or The multiple bait sequences of person (i.e. scoring functions value is maximum): S=a × S_Specificity+b×S_Dimer+c×S_{Hairpin structure}+d×S_{Relative distance}, wherein A=0.26-0.34, b=0.08-0.12, c=0.17-0.23, d=0.35-0.45, marking are calculated by own software and are provided, Rule is as follows:

S_SpecificityMarking rule: to newly-designed any bar bait sequences, sequence alignment is carried out to it in the genome, is adopted With BLAT software, thermodynamics Tm parameter is calculated separately to its each comparison result using default parameters, if there is with target Region T_mWith non-specific region T_m5 DEG C of difference <, then abandon the bait sequences, redesign, 10 DEG C of part of data < work For comparison；Otherwise the average Tm value of all comparison results, final S are calculated_Specificity=1-Tm_{Average value}/(Tm_Target- 5), partial data S_Specificity =1-Tm_{Average value}/(Tm_Target- 10) as a comparison, wherein Tm_{Average value}It is bait sequences and all non-specific region comparison results are averaged Tm value, Tm_TargetIt is bait sequences and target area T_m；

S_DimerMarking rule: to newly-designed any bar bait sequences, the bait sequences designed with each into Row dimer compares analysis, using BLAT software, calculates separately thermodynamics to its each comparison result using default parameters Tm parameter, if there is T_m>=47 DEG C, then the bait sequences are abandoned, redesigned；Otherwise the average Tm of all comparison results is calculated Value, final S_Dimer=(47-Tm_{Average value})/47, partial data T_m>=37 DEG C as a comparison, then abandons the bait sequences, set again Meter；Otherwise the average Tm value of all comparison results, S are calculated_Dimer=(37-Tm_{Average value})/37；

S_{Hairpin structure}Marking rule: it is optimal that its is calculated using Smith-Waterman algorithm to any bar bait sequences Itself compares structure, and calculates its thermodynamics Tm parameter value according to this configuration, if there is T_m>=47 DEG C, then abandon the bait sequence Column redesign；Otherwise its S_{Hairpin structure}=(47-Tm)/47, partial data is if there is T_m>=37 DEG C as a comparison, then abandons this and lure Bait sequence redesigns；Otherwise its S_{Hairpin structure}=(37-Tm_{Average value})/37；

S_{Relative distance}Marking rule: known target area coordinates to be designed calculate itself and target to any bar bait sequences Area coordinate value of delta_Distance, acceptable difference is set as 150, which is empirical value；If difference is greater than 150, The bait sequences are abandoned, are redesigned；Otherwise its S_{Relative distance}=(150- δ_Distance)/150.With target area coordinates difference 150 Suitable bait sequences can not be designed in range, also set 300, S for partial difference as a comparison_{Relative distance}=(300- δ_Distance)/300。

3. again, carrying out the compensation of bait sequences copy number according to objectives areas case:

A) according to the Stability Classification situation of target sequence, (i.e. using 3 grades of bait sequences copy number as reference copy number Benchmark 1)；1 grade and 5 grades of corresponding bait sequences need to increase more copy number, are 2.5 times of the 3rd gear；Followed by 2 grades and 4 Shelves, corresponding bait sequences are also required to 3.5 times that slightly more copy numbers is the 3rd gear；

B) for the target sequence of formation stable space structure, bait sequences copy number is double；

It c) may be when paying close attention to region, for instance it can be possible that the region that fusion event occurs, bait for target area Sequence copy numbers are double；

D) parallel test of bait sequences copy number uncompensation is in addition carried out under the same conditions as control.

4. finally, when target sequence can not design probe, for example, when target area is high or extremely low G/C content area When domain, or when target area is low complex degree region (low complex degree region refers to the element such as few nucleosides by seldom type A region composed by acid, such as this simple repeated sequence of microsatellite), since the region can not design bait sequences, i.e., Bait sequences coverage rate is zero, then appropriate area design bait sequences can be found at left and right sides of the target area；General meeting Range within the 300bp of the left and right sides designs bait sequences；If the region within 150bp can design suitable bait sequence Column, then record as control.There are 138 in the present embodiment in randomly selected target sequence and belong to such case, 68 at it Region successful design within the 150bp of left and right goes out bait sequences, in addition 22 in the 150-300bp of its left and right successful design go out lure Bait sequence still has 48 can not all design probe in these regions.

5. the bait sequences of final design are shown in that situation is shown in Table 2.

Table 2: bait sequences design conditions

Wherein the condition of stringent scoring functions limitation is: with target area T_mWith non-specific region T_m>=10 DEG C, S_Specificity= Tm_{Average value}/37；T_m37 DEG C of <, S_Dimer=(37-Tm_{Average value})/37；T_m37 DEG C of <, S_{Hairpin structure}=(37-Tm_{Average value})/37。

Embodiment 2: the preparation of bait sequences

According to embodiment 1 design bait sequences carry out sequence preparation, bait sequences the preparation method is as follows:

1. adding the specific sequence that length is 20 bases respectively at the end of bait sequences 5 ' and 3 ' ends, specific sequence is set Meter principle is: 1) non-specific amplification product will not be generated on target (to be captured) genome；2) G/C content is located at 30%-70% Between, between preferably 40%-60%；3) the two not will form dimer, or dimer free energy≤47 DEG C formed, preferably ≤37℃.To form sequence to be synthesized, all bait sequences are exemplified below with a pair of of specific sequence:

5 ' end-specificity sequences--3 ' end-specificity sequence of bait sequences (60-150bp etc.) is (SEQ ID NO.1):

ATATAGATGCCGTCCTAGCG-NNNNNNNNNN......NNNNNNNNNN-TGGGCACAGGAAAGATACTT。 Wherein " NNNNNNNNNN......NNNNNNNNNN " indicates bait sequences.

2. specific sequence through the invention people's independent development solution hybridization capture sequencing probe design software generate.

3. sequence to be synthesized is utilized the extensive synthetic oligonucleotide of chip method well known in the art, then with using ammonium hydroxide Oligonucleotides on chip is eluted, by being dissolved in distilled water after purification, forms oligonucleotides pond.

4. using oligonucleotides pond as template, 5 ' the end primers complementary with 5 ' end-specificity sequences and 3 ' end-specificity sequences and 3 ' end primers are primer, and using Taq polymerase, (JumpStart Taq DNA Polymerase is purchased to Sigma, Catalog No.D6558 polymerase chain reaction amplification) is carried out, a large amount of double-stranded DNA pond is obtained, specific steps are as follows:

1) reaction system is as follows:

2) reaction condition is as follows:

3) QIAGEN PCR purification kit (QIAGEN, Cat No./ID 28104) is used, according to its operational manual Carry out PCR product purifying:

4) 5 ' end bands T7 sequence (TAATACGACTCACTATAGGG) of 5 ' end primers is used to hold as forward primer and 3 ' Primer as reverse primer, using Taq polymerase (JumpStart Taq DNA Polymerase is purchased to Sigma, Catalog No.D6558) polymerase chain reaction amplification is carried out, form 5 ' double-stranded DNA ponds of the end with T7 sequence.It operates as follows:

5) reaction system:

Reagent name	Volume
		Water	37μl
10×PCR Buffer	5μl
		10mM dATP	1μl
10mM dCTP	1μl
		10mM dGTP	1μl
10mM TTP	1μl
		BAITS_5_PRIMER_N-T7(10μM)	1μl
BAITS_3_PRIMER_N(10μM)	1μl
		JumpStart Taq DNAPolymerase	1μl
Oligonucleotides pond	1μl

6) reaction condition is as follows:

Previous step PCR reaction product is separated using gel electrophoresis, removes non-specific band, recycles 120-210bp Region segments, using Qiagen plastic recovery kit (QIAquick Gel Extraction Kit, Cat No./ID 28704) It is purified；

7) T7 High Yield RNA Transcription Kit (Vazyme, TR101-01/02) is used, core is utilized The NTP and biotin of acid-like substance (glycerol nucleic acid GNA, lock nucleic acid LNA, peptide nucleic acid PNA, threose nucleic acid TNA or morpholine nucleic acid) The UTP of label is substrate, is transcribed in vitro to previous step glue recovery purifying product, is prepared into the nucleic acid containing biotin labeling Like object pond:

37 DEG C incubation 8-12 hours, obtain maximum output nucleic acid analog pond, be diluted to 500ng/ μ l after purification, be placed in- 80 DEG C of refrigerators save.

In addition using parallel test under the same terms in standard nucleic acid ATP, CTP, GTP, UTP and Biotin-UTP as pair According to.

Implement 3: target area library captures

1. the DNA library preparation for high-throughput capture sequencing:

1) the 1 μ g of genomic DNA for taking tested species, is beaten at random using sonicator Bioruptor pico Break to 150-250bp small fragment；

2) small fragment text before being captured using Illumina TruSeq DNA library preparation kit Library preparation.

2. carrying out target area Library hybridization using the nucleic acid analog pond of preparation and the small fragment library of target species to catch It obtains:

1) closing primer prepares:

It is synthesized according to the above primer sequence, every kind of 100 OD of synthesis, every kind of primer is diluted to 1000 μM, and according to Isometric mixing, is named as Block 1；

2) cot-1 DNA and salmon sperm DNA are diluted to 100ng/ μ l, and mixed in equal volume, is labeled as Block 2；

3) it takes the 6 μ l of μ l Block 1 and 5 Block 2 to be mixed, is labeled as Block Mix；

4) it takes 1 μ g small fragment genomic library to mix with 11 μ l Block Mix, and uses frozen drying centrifuge It carries out being concentrated into 9 μ l, is labeled as reagent S1, be placed in stand-by on ice；

6) 20 μ l hybridization solutions (20 × SSPE, 2 × Dennard`s, 1mM EDTA, 1%SDS) is taken to be placed on 65 DEG C of metal baths Preheating is labeled as S2；

7) 5 μ l pure water are taken, 2 μ l 500ng/ μ l nucleic acid analog ponds are added after mixing, slowly suction is beaten mixes for several times, marks For S3, it is placed in stand-by on ice；

8) by PCR instrument parameter setting at 95 DEG C, 5min；65 DEG C, 16h；65 DEG C, constant temperature；105 DEG C of hot lid；

9) S1 is placed in PCR module, starts PCR program, program is run to 65 DEG C of 5min, and S2 is put into PCR instrument mould Block continues after being incubated for 5min, S3 is put into PCR instrument module, continues to be incubated for 2min；

10) pipettor is adjusted to 13 μ l, 13 μ l S2 is taken to be transferred to S3,9 μ l S1 is taken to be transferred to S3, slowly inhaled to beat and fill for several times Divide and mix mixture, seal pipe lid covers PCR heat lid, is incubated for 16 hours carry out probes and Library hybridization；

11) taking 50 μ l Dynabeads MyOne Streptavidin T1, (article No.: 65601) Invitrogen is placed in In 1.5ml low adsorption centrifuge tube, 200 μ l combination liquid [0.5M NaCl (Ambion, article No.: AM9760G), 2mM Tris- is added HCl, pH 8.0 (Ambion, article No.: AM9855G), 0.2mM EDTA (Ambion, article No.: AM9260G)], suction, which is beaten, mixes postposition In 1min on magnetic frame, supernatant is removed；

12) centrifuge tube is removed from magnetic frame, adds 200 μ l combination liquid, suction is played mixing and is placed on magnetic frame 1min removes supernatant；

13) it repeats step 11 twice, carries out 3 magnetic bead cleanings altogether, magnetic bead finally is resuspended with 200 μ l combination liquid；

14) probe, Library hybridization mixed liquor (step 9 product) are transferred in magnetic bead re-suspension liquid, seal pipe lid is placed in rotation Turn to mix on blending instrument and combines 30min；

15) centrifuge tube is placed in 2min on magnetic frame, removes supernatant；

16) centrifuge tube is removed from magnetic frame, be added 200 μ l cleaning solutions 1 [10 × SSC (and Ambion, article No.: AM9763), 1%SDS (Invitrogen, article No.: 24730020)] be resuspended magnetic bead, seal pipe lid, be placed in rotation blending instrument supernatant Wash 10min；

17) centrifuge tube is placed in 2min on magnetic frame, removes supernatant；

18) centrifuge tube is removed from magnetic frame, be added 65 DEG C of 200 μ l preheating cleaning solutions 2 [1 × SSC (Ambion, Article No.: AM9763), and 5%SDS (Invitrogen, article No.: 24730020)] magnetic bead is resuspended, 65 DEG C are placed in PCR instrument module It is incubated for 10min；

19) centrifuge tube is placed in 2min on magnetic frame, removes supernatant；

20) it repeats step 17-18 twice, carries out 3 cleanings altogether；

21) 200 μ l, 80% ethanol solution is added into centrifuge tube, stands 30s, removes whole alcohol, room temperature is dried 2min, 20 μ l pure water of addition slowly are inhaled to beat is resuspended magnetic bead for several times；

3.PCR be enriched with target area capture product, using NEB high fidelity PCR kit (High- Fidelity PCR Kit, New England Biolabs, Catalog#E0553S):

1) reaction system:

Reagent name	Volume
		5×Phusion HF	10μl
10mM dNTPs	1μl
		Post Prmier Mix (equal 10 μM)	1μl
Magnetic bead (step 20) is resuspended	20μl
		Phusion archaeal dna polymerase	0.5μl
H₂O	17.5μl

2) reaction condition is as follows:

3) PCR product is carried out using Beckman Agencourt AMPure XP Kit [Beckman (p/n A63880)] Purifying；

4) target area capture library is carried out using Illumina microarray dataset carry out high-flux sequence, sequencing reading length suggestion Use PE150 mode.

3. result

1) Illumina high-flux sequence instrument Hiseq 4000 is used, upper machine sequencing is carried out to sequencing library, obtains 1000 The sequencing data in a site；

2) BWA MEM software is utilized, sequencing data is compared with to the mankind with reference to genome HG19, parameter used Are as follows: bwa mem-M-k 40-t 8-R "@RG tID:Hiseq tPL:Illumina tSM:sample ", to obtain and refer to The different single nucleotide polymorphism of genome, insertion or missing, i.e. detected gene mutation.

3) using the size of the samtools stats tool statistical data in samtools-1.2 software, comparison rate, again Multiple rate, mass value, then again with the samtools depth tool in software, the sequencing for calculating each position in target area is deep Degree；

4) according to the sequencing depth of each position in target area, respectively statistics sequencing depth >=1, >=4, >=10 and >=20 Base quantity, then by the base quantity divided by the total bases amount of target area, thus obtain 1 × coverage rate, 4 × coverage rate, The parameter of 10 × coverage rate and 20 × coverage rate.

The site table 3:1000 captures sequencing result

From the above table 3 as can be seen that by taking LNA as an example, mean depth has 451.53 layers；4 × coverage rate has 94.35%, and 20 × coverage rate also has 93.64%, has preferable coverage rate and homogeneity, and total amount of data is only 8.52Mb reads.This The result bring beneficial effect of sample has: 1) sequencing amount is small, effectively reduces cost；2) average sequencing depth is high, i.e. each mesh Mark point is sequenced repeatedly, thus data accuracy is high；3) coverage rate is high, and it is few to omit site；4) homogeneity is good, i.e., most Site has similar overburden depth.

According to the analysis as the data subset and contrasting data that compare, compared with LNA, bait sequences copy number is not Coverage rate and homogeneity decline 4.5 and 5.1 percentage points respectively in the case where compensation；Strong specificity limitation, stringent dimer limit Coverage rate and homogeneity increase separately 6.3 and 7.8 in the case that system, the limitation of stringent hairpin structure and stringent scoring functions limit Percentage point；The areal coverage and big 2.3 and 3.8 percentages of homogeneity difference in region and 150-300bp within 150bp Point；Distinguished with standard nucleic acid ATP, CTP, GTP, UTP and Biotin-UTP parallel test coverage rate and homogeneity of same ratio Reduce by 5.3 and 4.8 percentage points.

Although having been combined preferred embodiment, invention has been described, it is to be understood that protection scope of the present invention is simultaneously It is not limited to embodiment as described herein.In conjunction with the explanation and practice of the invention disclosed here, other implementations of the invention Example all will be readily apparent and understand for those skilled in the art.Illustrate and embodiment is regarded only as being exemplary, this hair Bright true scope and purport is defined in the claims.

Claims

1. a kind of method from nucleic acid samples enrichment target sequence nucleotides, which comprises

A) provide the nucleic acid samples comprising target nucleic acid sequence and it is consistent with target nucleic acid sequence or to target sequence with feature The bait sequences of property, wherein to each target area, the bait sequences be specificity, dimer, hairpin structure and with One or more optimal bait sequences of comprehensive score in terms of the relative position of target area, the comprehensive score pass through as follows Scoring functions carry out: S=a × S_Specificity+b×S_Dimer+c×S_{Hairpin structure}+d×S_{Relative distance}, wherein a=0.26-0.34, b=0.08- 0.12, c=0.17-0.23, d=0.35-0.45, specific calculation method of giving a mark are as follows:

S_SpecificityMarking calculate: to newly-designed any bar bait sequences, sequence alignment is carried out to it in the genome, it is every to its One sequence compared calculates separately Tm between the bait sequences and the sequence compared, the bait sequences and target area Domain T_mIt compares upper sequence T with any_mDifference >=5 DEG C, calculate flat between the bait sequences and all sequences compared Equal Tm, S_Specificity=1-Tm_{Average value}/(Tm_Target- 5), wherein Tm_{Average value}It is bait sequences and all non-specific region comparison results are averaged Tm value, Tm_TargetIt is bait sequences and target area T_m；

S_DimerMarking calculate: to newly-designed any bar bait sequences, carry out two with the bait sequences that each has designed Aggressiveness compares analysis, and the sequence compared to its each calculates separately the bait sequences and the bait sequences compared Between Tm, the T_m< 47 DEG C, calculate the average Tm, S between the bait sequences and all bait sequences compared_Dimer =(47-Tm_{Average value})/47；

S_{Hairpin structure}Marking calculate: to any bar bait sequences, calculate its it is optimal itself compare structure, and calculate the structure Tm, the T_m< 47 DEG C, and S_{Hairpin structure}=(47-Tm)/47, Tm < 47 DEG C, and S_{Hairpin structure}=(37-Tm_{Average value})/37；

S_{Relative distance}Marking calculate: itself and the mesh are calculated to newly-designed any bar bait sequences for target area coordinates Mark area coordinate value of delta_Distance, δ_DistanceLess than 150, S_{Relative distance}=(150- δ_Distance)/150；

B) it is transcribed in vitro using the bait sequences as template using nucleic acid analog GNA, LNA, PNA, TNA or morpholine nucleic acid Nucleic acid analog is prepared, the nucleic acid analog has bound fraction；

C) make the nucleic acid sample fragment；

D) nucleic acid analog hybridizes with the nucleic acid samples, so that the nucleic acid analog and the target sequence nucleotides shape At nucleic acid analog/DNA hybridization compound；

E) by the bound fraction, the nucleic acid analog/DNA hybridization compound is separated from non-specific hybridization nucleic acid, Remove non-targeted sequencing nucleic acid.

2. the method according to claim 1, S_SpecificityMarking calculate: to newly-designed any bar bait sequences, in the genome Sequence alignment is carried out to it, the sequence compared to its each calculates separately between the bait sequences and the sequence compared Tm, the bait sequences and target area T_mIt compares upper sequence T with any_mDifference >=10 DEG C, calculate the bait sequences with Average Tm, S between sequence in all comparisons_Specificity=1-Tm_{Average value}/(Tm_Target- 10), wherein Tm_{Average value}It is bait sequences and institute There are the average Tm value of non-specific region comparison result, Tm_TargetIt is bait sequences and target area T_m。

3. the method according to claim 1, S_DimerMarking calculate: to newly-designed any bar bait sequences, with each Bait sequences through designing carry out dimer and compare analysis, and the sequence compared to its each calculates separately the bait sequences With the Tm between the bait sequences on described compare, T_m< 37 DEG C, calculate the bait sequences and all bait sequences compared Between average Tm, S_Dimer=(37-Tm_{Average value})/37。

4. method according to claim 1-3 further includes step f): multiple to the nucleic acid analog/DNA hybridization It closes object to be expanded, achievees the purpose that be enriched with target sequence nucleotides.

5. method according to claim 1-3, wherein in step b), the bound fraction is biotin engaging portion Point.

6. method according to claim 1-3, wherein the nucleic acid samples be genomic DNA, RNA, cDNA, MRNA, in the case where the nucleic acid samples are RNA or mRNA, have before the step c) by RNA the or mRNA reverse transcription at The step of DNA.

7. method according to claim 1-3 wherein in step c), makes the nucleic acid sample fragment, preparation Library.

8. method according to claim 1-3, the bait sequences are on a solid carrier.

9. method according to claim 8, the solid carrier is microarray slide.

10. method according to claim 1-3, wherein the bait sequences have characteristic chosen from the followings: i) Hairpin structure itself is not generated and is generated between each other without dimer, ii) copy number is according to the GC of the target nucleic acid sequence Content and/or space structure compensate, iii) when the target area is high or extremely low G/C content region or work as When target area is low complex degree region, the target area two side areas is used to design bait, design method as replacement area , iv consistent with the target area) with the other sequences except target nucleic acid sequence in nucleic acid samples without specific binding.