CN108203847A

CN108203847A - For library, reagent and the application of the assessment of two generation sequencing qualities

Info

Publication number: CN108203847A
Application number: CN201711330411.7A
Authority: CN
Inventors: 廖莎; 闫东东; 章文蔚; 徐崇钧; 陈奥; 陈莹; 赵杰; 许军强; 傅德丰; 何琳
Original assignee: Shenzhen Hua Made Dazhi Technology Co Ltd
Current assignee: Shenzhen Hua Made Dazhi Technology Co Ltd; MGI Tech Co Ltd
Priority date: 2016-12-16
Filing date: 2017-12-13
Publication date: 2018-06-26
Anticipated expiration: 2037-12-13
Also published as: CN108203847B; HK1250759A1

Abstract

This application discloses a kind of library, reagent and applications for being used for the assessment of two generation sequencing qualities.The library of the application is the single-stranded DNA banks of the known array with different bases feature, and joint sequence and index sequence are connected in library；Single-stranded DNA banks include at least one of high AT contents single stranded DNA, high GC content single stranded DNA, poly structures single stranded DNA and hairpin structure single stranded DNA.The application library is sequenced using the known array with different bases feature, can assess influence and deviation of the different bases feature to sequencing, is realized sequencing quality assessment, is corrected these deviations, realizes sequencing optimization.The method that the application improves nucleic acid sequencing accuracy by comparing sequencing result and known array, obtains the sequencing deviation of different bases feature, and guidance sequencing software algorithm improves, and improves sequencing accuracy；The application method can effectively reduce sequencing deviation, and a kind of simple effective method is provided to improve sequencing accuracy.

Description

For library, reagent and the application of the assessment of two generation sequencing qualities

Technical field

This application involves nucleic acid sequencing quality evaluation field, more particularly to a kind of text for being used for the assessment of two generation sequencing qualities Library, reagent and application.

Background technology

High throughput sequencing technologies are a very important technologies, are all had to pass in biological study and clinical practice Important role, the particularly effect in accurate medical treatment become more and more important.As the sequencing of two generations is in accurate medical treatment Status is more and more important, and accuracy requirement, which is sequenced, to it accordingly also steps up.The Mainstream Platform of current two generations sequencing, such as Illumina and proton, is attained by 99.9% accuracy rate, but the complexity of the lengthening of sequencing reading length and base contents Degree etc. may all cause the decline of accuracy rate is sequenced.In order to better meet effect of the two generations sequencing in accurate medical treatment, having must Constantly to promote sequencing technologies.

The basic procedure of current two generations sequencing includes, the structure of sequencing library, and the amplification of library signal will by Sequenase Base signal is changed into the optical signal that sequenator can identify, optical signal is reduced into base letter finally by computer software Breath.

In above-mentioned sequencing basic procedure, there are several aspects all easily to introduce sequencing mistake, cause under the accuracy rate being sequenced Drop：(1) in library construction process, interrupting for segment may lead to base mutation or missing, and PCR amplification may introduce the alkali of mistake Basigamy pair；(2) the signal amplification in library is typically also to be carried out by PCR, likewise, the fidelity sex chromosome mosaicism of enzyme may also introduce The mistake of sequencing；(3) during base signal being transformed into electric signal, because dNTP is usually the dNTP with modification, sequencing Enzyme is that the dNTP of cooperation transformation is also required to be transformed accordingly, and fidelity can also be affected to a certain extent, so as to Lead to the accuracy of sequencing；(4) during final data analysis software converts optical signal into base information, it is also possible to Because the factors such as fluorescence background, impurity, signal be weak cause signal processing to malfunction.

Under normal circumstances, in order to verify the accuracy rate of two generations sequencing, generation sequencing sanger PCR sequencing PCRs can be selected to be tested Card.But the method is cumbersome, in being unfavorable for for being researched and developed in sequencing technologies, targetedly improves each ring in sequencing procedure Caused error rate in section.

Invention content

The purpose of the application is to provide a kind of new library, reagent and application for the assessment of two generation sequencing qualities.

The application's discloses the library assessed for two generation sequencing qualities on one side, which is with different bases feature Known array single-stranded DNA banks, and joint sequence and index sequence are connected in library；Wherein, there are different alkali It is single-stranded that the single-stranded DNA banks of the known array of base feature include high AT contents single stranded DNA, high GC content single stranded DNA, poly structures At least one of DNA and hairpin structure single stranded DNA.

It should be noted that in the sequencing procedure of practical unknown nucleotide sequence, it is understood that there may be various influences sequencing it is accurate Property base feature, such as high AT contents, high GC content, poly structures and hairpin structure etc., the application it is creative use people Work synthetic method, synthesis have the single-stranded DNA banks of the known array of more than different bases feature；In this way by comparing sequencing As a result and known array, it is possible to which what kind of sequencing deviation is microarray dataset specifically will appear used by knowing, so as to two generations Sequencing quality is assessed.By to these sequencing deviations, further can also targetedly be corrected, and then improve and survey Sequence accuracy.

It is appreciated that the application is used for the library of two generation sequencing qualities assessment, commented in addition to two generation sequencing qualities can be carried out Other than estimating, as previously mentioned, can also further the sequencing of two generations be corrected, be optimized, with improve sequencing accuracy or Sequencing quality.

It should also be noted that, for ease of use, Library development flow is reduced also for further, reduces Library development flow institute The base mistake or error of introducing, the application is preferred, connects joint sequence and index sequence in advance in library, that is, It says, in the sequence of synthetic library, joint sequence and index sequence are directly artificial synthesized together；Avoiding problems again to text Add the reaction step of joint sequence and index sequence in library.Wherein, the particular sequence of joint sequence and index sequence can refer to Existing microarray dataset, does not limit herein.

Preferably, the both ends in library also have universal primer binding sequence.

It should be noted that the purpose of universal primer binding sequence is identical in order to make not homotactic library that can use Primer pair expanded, for example, by six libraries of the application, using identical universal primer binding sequence, then only need Six libraries can be expanded using pair of primers, do not need to each library pair of primers.

Preferably, the library of the application sequence, sequence, SEQ ID shown in SEQ ID NO.8 as shown in SEQ ID NO.7 Sequence shown in NO.9, sequence shown in SEQ ID NO.10, sequence shown in SEQ ID NO.11 and sequence shown in SEQ ID NO.12 At least one of composition.

It should be noted that the library of sequence shown in SEQ ID NO.7 to 12 is in a kind of realization method of the application, Six libraries effectively two generation sequencing qualities can be assessed by verification, optimized；Those skilled in the art are according to this The guidance of application on the basis of the application, can be commented with artificial synthesized more libraries for be sequenced in two generations with progress quality Estimate or optimize.

The application's discloses a kind of cloning vector on one side again, which includes plasmid and Insert Fragment, wherein, it inserts Enter the library that segment includes the application.

Preferably, plasmid is pMD18-T or pMD19-T.

It should be noted that in a kind of preferred embodiment of the application, artificial synthesized library sequence is inserted into plasmid, Primary artificial synthesized library is only needed, passes through the duplication of plasmid later, it is possible to unlimited number of acquisition library sequence.

The application's discloses a kind of engineering bacteria on one side again, which includes recipient bacterium and import and be stored in recipient bacterium In the application cloning vector.

Preferably, the recipient bacterium that the application uses is Escherichia coli.

It should be noted that the application will be after in library clone to plasmid, it is only necessary to as soon as time single-stranded DNA banks are synthesized, It can infinitely use, without synthesizing again, reduce sequent synthesis cost.In subsequent use, it is only necessary to culturing engineering bacterium, Extract plasmid, it is possible to obtain required library.Also, used by having included corresponding microarray dataset in the library sequence Sequence measuring joints can be sequenced by simple library construction.Whole process is simple, conveniently, and stability is high.

The application's discloses a kind of reagent for being used for the assessment of two generation sequencing qualities on one side again, which includes the application's The engineering bacteria in library, the cloning vector of the application or the application.

Library, cloning vector and the engineering bacteria of the application may be used to the assessment of two generation sequencing qualities or for two Generation sequencing is corrected, is optimized, to improve sequencing accuracy or sequencing quality；Therefore, any of which can be made to examination Agent box, with easy to use.

Preferably, in the reagent of the application, universal primer is further included, the sense primer of universal primer is SEQ ID NO.13 Shown sequence, downstream primer are sequence shown in SEQ ID NO.14.

It should be noted that the universal primer binding sequence that universal primer is the both ends for being directed to library designs, it can be right Library or cloning vector are expanded, to obtain library sequence.For ease of use, the application is using universal primer as one Independent packaging is put into the kit of the application.

It should also be noted that, for cloning vector, such as pMD18-T or pMD19-T, itself has plasmid amplification to draw Object can be directed to plasmid design primer, and different Insert Fragment progress is expanded simultaneously, then is not needed at the both ends in library Universal primer binding sequence is designed, amplified library or sequencing can also directly use plasmid amplification primer, not need to individual SEQ The universal primer of sequence shown in sequence shown in ID NO.13 and SEQ ID NO.14.What kind of do not limited herein using mode specifically It is fixed.

It is furthermore preferred that in the reagent of the application, splint oligo are further included, splint oligo are SEQ ID NO.15 Shown sequence.

It should be noted that the effect of splint oligo is to be cyclized library DNA, in a kind of realization method of the application It is sequenced using DNA nanosphere technologies, it is therefore desirable to which library is cyclized.It is appreciated that if DNA nanospheres are not used Technology can not also use splint oligo, be not specifically limited herein.

The application also disclose on one side again the library of the application, the cloning vector of the application, the application engineering bacteria or The reagent of person the application is in base and sequencing quality relationship assessment, amplification enzyme base Preference and accuracy evaluation, Sequenase standard True property assessment, base signal extraction assessment improve, the sequencing accuracy detection of two generations, build library to sequencing links error rate inspection Library is surveyed or built to the application in sequencing links scheme optimization.

It should be noted that the library of the application, cloning vector, engineering bacteria and reagent based on the application library can be with Quality evaluation is carried out for two generations to be sequenced；Its principle is exactly inclined between comparative analysis sequencing result and known library sequence Difference, the deviation can assess sequencing quality, assess amplification enzyme, the accuracy of Sequenase or optimized based on the deviation. It is appreciated that based on principles above, library, cloning vector, engineering bacteria and the reagent of the application etc., flow can be sequenced to two generations Each step assessed, detected, optimized, be not specifically limited herein.

The application's also discloses a kind of method for improving nucleic acid sequencing accuracy on one side again, including being had not using one Single-stranded DNA banks with the known array of base feature are sequenced, and sequencing result and known array are compared, statistical Deviation is sequenced present in different bases feature in analysis, according to sequencing bias correction sequencing software algorithm, is surveyed so as to improve nucleic acid Sequence accuracy；The single-stranded DNA banks of known array with different bases feature include high AT contents single stranded DNA, high GC content At least one of single stranded DNA, poly structures single stranded DNA and hairpin structure single stranded DNA.

Preferably, poly structures single stranded DNA includes poly A structures single stranded DNA, poly T structures single stranded DNA, poly G At least one of structure single stranded DNA and poly C-structure single stranded DNAs.

Preferably, single-stranded DNA banks are exactly the library of the application.

It should be noted that the method that the application improves nucleic acid sequencing accuracy, the actually text based on the application Library according to the principle of the application, was sequenced for two generations and carries out quality evaluation, and then optimizes and improve sequencing accuracy.It is appreciated that base In identical principle, on the basis of " method for improving nucleic acid sequencing accuracy " of the application, the application can also provide core Method, base and sequencing quality relationship assessment method, amplification enzyme base Preference and the accuracy evaluation that sour sequencing quality is assessed Method, Sequenase Accuracy Evaluation, base signal extraction assessment or ameliorative way, two generations sequencing method for detecting accuracy, Library is built to sequencing links error rate detection method or builds library to links scheme optimization method etc. is sequenced, is not done herein It is specific to limit.

It should also be noted that, the accuracy of nucleic acid sequencing can be improved by the present processes, likewise, using this The method of application can also assess the base Preference and accuracy of amplification enzyme, for example, by comparing amplification enzyme is used The sequencing result of the front and rear single-stranded DNA banks of amplification, it is possible to analyze influence of the amplification enzyme to sequencing deviation, be commented so as to reach Estimate the purpose of amplification enzyme accuracy, the concrete type of deviation is sequenced by analysis, it is possible to know the base Preference of amplification enzyme. It is similar to the assessment principle of Sequenase accuracy.In addition, the present processes can improve the accuracy of nucleic acid sequencing, it is crucial It also resides in, after comparative analysis sequencing result and known array, corrects sequencing software algorithm, just include base signal among these The processing of extraction, therefore, the present processes can be applied to improve or assess base signal extraction.

Due to using the technology described above, the advantageous effect of the application is：

The base feature of various structure-controllables is designed in the library of the application wherein, using these known base features Sequence is sequenced, it can be estimated that the quality of two generations sequencing is realized in the influence and deviation that different base features was sequenced for two generations These deviations are targetedly corrected in assessment, and then realize and the sequencing of two generations is optimized.The application improves nucleic acid sequencing accuracy Method, the creative base feature for employing the application library by comparing sequencing result and known library sequence, obtains The sequencing deviation of different bases feature so as to instruct the improvement of sequencing software algorithm, and then reaches the mesh for improving sequencing accuracy 's；By the present processes, sequencing deviation can be effectively reduced, is provided for raising sequencing accuracy a kind of simple and effective Method.

Description of the drawings

Fig. 1 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.7 in the embodiment of the present application；

Fig. 2 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.8 in the embodiment of the present application；

Fig. 3 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.9 in the embodiment of the present application；

Fig. 4 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.10 in the embodiment of the present application；

Fig. 5 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.11 in the embodiment of the present application；

Fig. 6 is the preceding 50bp sequencing result figures of Sequence Library shown in SEQ ID NO.12 in the embodiment of the present application；

Fig. 7 is the Q30 distribution maps in high GC libraries in the embodiment of the present application.

Specific embodiment

The application is by a large amount of experiment and the study found that in practical sequencing procedure, in face of various sequencings pair As the complexity of base contents is current an important factor for influencing two generation sequencing qualities.Such as AT, a GC are distributed Uniformly, poly structures and the less sequence of hairpin structure, illumina and proton are attained by 99.9% accuracy rate；But It is that for the more sequence of high AT contents, high GC content or poly structures and hairpin structure, the accuracy of sequencing is just significantly Decline in addition can not effectively meet the use demand that is accurately sequenced in precisely medical treatment.

For this purpose, the proposition of the application creativeness and have developed the known array with different bases feature single stranded DNA text Library just includes specially designed various base features in sequence, including high AT contents, high GC content, poly structures and hair Clamping structure etc.；In a kind of realization method of the application, i.e., shown in SEQ ID NO.7 to SEQ ID NO.12 six of sequence it is single-stranded DNA；The library designed using the application carries out the sequencing of two generations, and analyze and compare sequencing to the known array of particular bases feature As a result the deviation between known library sequence, so as to analyze two generations sequencing in the case that various base features accuracy or Sequencing quality, the deviation obtained for analysis are corrected, and then optimize the sequencing of two generations.

It should be noted that before the library of structure the application, the application devises a set of standard nucleic acid in advance, this kind of The various base features needed for the application library are contained in nucleic acid, which part is then chosen by this set of standard nucleic acid again Or whole sequences carries out library construction.In a kind of realization method of the application, standard nucleic acid by six single stranded DNAs extremely A few composition；The sequence of six single stranded DNAs is sequentially sequence shown in SEQ ID NO.1, sequence, SEQ shown in SEQ ID NO.2 Sequence shown in ID NO.3, sequence shown in SEQ ID NO.4, sequence shown in SEQ ID NO.5 and sequence shown in SEQ ID NO.6. The library of sequence shown in SEQ ID NO.7 to 12 in the application, sequentially corresponding is exactly the institutes of SEQ ID NO.1 to 6 of the application Show the standard nucleic acid of sequence.

Below by specific embodiment, the application is described in further detail.Following embodiment only to the application into advance one Step explanation, should not be construed as the limitation to the application.

Embodiment

This example devises one group respectively comprising bases such as high AT contents, high GC content, poly structures and hairpin structures first The standard nucleic acid sequence of feature then for the standard nucleic acid sequence design library, and is added BGISEQ in library sequence and is connect Header sequence, index sequence and universal primer binding sequence.Artificial synthesized designed library sequence, by artificial synthesized library sequence Row are inserted into pMD19-T plasmids, and plasmid is imported in Ecoli Escherichia coli, and engineering bacteria is made.Extract the matter in engineering bacteria Grain obtains library sequence, is sequenced for two generations, assesses sequencing quality.It is as follows in detail：

First, the design of standard nucleic acid

This example is according to record base feature relatively common in practical be sequenced, such as high AT contents, high GC content, poly knot Structure and hairpin structure etc., have separately designed six standard nucleic acid sequences, and each standard nucleic acid sequence pair should use different indexes Sequence.In detail as shown in table 1.

The sequence of 1 standard nucleic acid of table

In six standard nucleic acid sequences of this example, including two high GC sequences, two high AT sequences and two stochastic orderings Row, two random sequences are all the similar general sequences of ACGT contents, and two random sequences are used for comparative analysis.Wherein every mark Quasi- nucleic acid sequence respectively corresponds to an index sequence, i.e. barcode sequences, for distinguishing different sequences.Two high GC sequences Include hairpin structure and poly structures in two high AT sequences.

2nd, library sequence design and structure

For six standard nucleic acid sequences of this example design, wherein most sequence is chosen, builds library, also, in text The joint sequence for being suitble to BGISEQ is inserted into library, it is identical in the both ends connection of the different library sequences of six standard nucleic acid sequences Universal primer binding sequence.This example is for six single stranded DNA standard cores of sequence shown in SEQ ID NO.1 to SEQ ID NO.6 The library sequence of acid sequence design is sequentially sequence shown in SEQ ID NO.7 to SEQ ID NO.12.

SEQ ID NO.7：

5’-GATATCTGCAGGCATAGAATGAATATTATTGAATCAATAATTAAAGTCGGAGGCCAAGCGGTCTTA GGAAGACAAACTAGTACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTTACAACTACAGATAAT GGGCTGGATACATGGAATGATTATAGATATATTAAGGAATAATGTTAATTAATGCCTAAATTAATTAATCTAAGGGG GTTAATACTTCAGCCTGTGATATC-3’；

SEQ ID NO.8：

5’-GATATCTGCAGGCATGAATAATAATGGAATAGCAATAATTAAAGTCGGAGGCCAAGCGGTCTTAGG AAGACAACGATCAGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTATATAATGTAATACATAA TATTAATATATTAATTATTGTATGATTGTTATCTATTACAGTCTAGTACTGACCCGTAGACATATATGCCCCCGATT AATTACTTATCAGCCTGTGATATC-3’；

SEQ ID NO.9：

5’-GATATCTGCAGGCATCGGCCGCGGCGTCCAGTGCGCGGCGCTAGAGCCGGCAAGTCGGAGGCCAAG CGGTCTTAGGAAGACAACGCTATGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCCGCCGCG GTCGCTTGTCCGGCCGCCGGTCCGGCGCCGGCGGCGCAAAGTGCCAGGCCGAGCCGGCGAACCAGCGGTCCGAAAAA CACGGACACTCAGCCTGTGATATC-3’；

SEQ ID NO.10：

5’-GATATCTGCAGGCATCACCGCCGAGGCCGCGGCGGAGACCGCCGGCGCAGGAAGTCGGAGGCCAAG CGGTCTTAGGAAGACAACAGAGTGTACCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTCAAACTAC CGGCGCGGCGCTCCTCCGGCCGTCCGCCGCCGACCGGCGGCGGCGTTCCGGTGTGGCACTCCAGGTGGCCGGTTCTC TGCCAAGCGTCAGCCTGTGATATC-3’；

SEQ ID NO.11：

5’-GATATCTGCAGGCATGAAGAACAACCCCGCACGACGCCTACCAACCAAGTCGGAGGCCAAGCGGTC TTAGGAAGACAACTGTATCGTACAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGCTGTTCGCGGCC GATGTTCGTATAAGATATAAGTTTGGGTATATTCCAGTTTATCGATCGTATCGAAATGTATGAGTTTATACAGGTCC TACTTCAACTCAGCCTGTGATATC-3’；

SEQ ID NO.12：

5’-GATATCTGCAGGCATACTAGACCAGTTCATTATTATAGTGCTAGCCAAAGTCGGAGGCCAAGCGGT CTTAGGAAGACAAACATCAACGTCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTGACGGATTCCCT CGCTTTCTATTGGCTGACAGTACAAGTAACATAGGTTGGGTCGGTTAACCCTGCCGTCACAAGTGGAACGATGTTAA TAGTTGCGGTCAGCCTGTGATATC-3’；

More than in six library sequences " GATATCTGCAGGCAT " be 5 ' ends universal primer binding sequences, " TCAGCCTGTGATATC " is the universal primer binding sequence at 3 ' ends, for this two sections of sequence design universal primers. “AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAANNNNNNNNNNCAACTCCTTGGCTCACAGAACGACATGGCTACGA TCCGACTT " is the joint sequence comprising index sequence, wherein, " NNNNNNNNNN " is the index sequence of 10bp.SEQ ID The index sequence of Sequence Library shown in NO.7 is " ACTAGTACGT ", and the index sequence of Sequence Library is shown in SEQ ID NO.8 " CGATCAGTAC ", the index sequence of Sequence Library shown in SEQ ID NO.9 is " CGCTATGTAC ", shown in SEQ ID NO.10 The index sequence of Sequence Library is " CAGAGTGTAC ", and the index sequence of Sequence Library is shown in SEQ ID NO.11 " CTGTATCGTA ", the index sequence of Sequence Library shown in SEQ ID NO.12 are " ACATCAACGT ".

The universal primer of this example, sense primer are sequence shown in SEQ ID NO.13, and downstream primer is SEQ ID Sequence shown in NO.14；

SEQ ID NO.13：5’-GATATCTGCAGGCAT-3’；

SEQ ID NO.14：5’-GATATCACAGGCTGA-3’.

This example is sequenced using DNA nanosphere technologies, needs to be cyclized library DNA, for this purpose, this example devises splint Oligo, splint oligo are sequence shown in SEQ ID NO.15；

SEQ ID NO.15：5’-ATGCCTGCAGATATCGATATCACAGGCTGA-3’.

The library of sequence and universal primer, splint oligo shown in this example SEQ ID NO.7 to SEQ ID NO.12 All synthesized by Shanghai life work.

3rd, cloning vector and engineering bacteria structure

Synthesized library sequence is cloned, and cloning vector is imported in Ecoli Escherichia coli.This example clone carries The structure of body and engineering bacteria is synthesized by Nanjing Jin Sirui.

4th, library obtains

By the engineering bacteria of preservation, using LB culture mediums, in 37 DEG C of overnight incubations, using Thermo Fisher's Extracts kit extracts plasmid according to the specification mode of the kit.And PCR is carried out to the plasmid of extraction using universal primer Amplification can be directly used for being sequenced after pcr amplification product cyclisation.

1. plasmid extraction

The plasmid extraction of this example usesPlasmid extraction kit, extraction step referenceExplanation Book, it is not tired herein to state.

2.PCR is expanded

100 μ L of PCR amplification system, including：5 × high-fidelity enzyme reaction solution, 20 μ L, each group are divided into the dNTPs mixed liquors of 10mM 5 μ L, 1 μ L of high-fidelity enzyme of 1U/ μ L, 20 μM of 6 μ L of sense primer, 20 μM of 6 μ L of downstream primer, extraction 1 μ L of plasmid template, ddH₂61 μ L of O amount to 100 μ L.

PCR amplification condition is 98 DEG C of 3min, subsequently into 33 cycles：98℃20s、60℃15s、72℃30s；Cycle After, 72 DEG C of 5min, 4 DEG C of Hold.

3.PCR amplified productions are cyclized

This example first using magnetic beads for purifying pcr amplification product, then according to BGIseq500SE50 cyclisation build library kit and Flow is cyclized the pcr amplification product of purifying.The specific steps reference reagent box specification of pcr amplification product cyclisation, This, which does not tire out, states.

5th, library sequencing detection and sequencing accuracy detection

Library to verify synthesized known array can meet the sequencing of BGISEQ platforms, the SEQ ID that this example is obtained Six libraries of sequence shown in NO.7 to SEQ ID NO.12 carry out the sequencing of SE50+10 according to BGISEQ500SE50 kits Verification.

The cyclisation product in six libraries is taken, DNB preparations are carried out according to the operating process of BGISEQ500.Then system is taken respectively Each 15 μ L of standby DNB are mixed into the DNB systems of 90 μ L, and chip manufacturing is carried out, and SE50+10 is selected to be sequenced according to normal process Pattern is sequenced.

Sequencing result is shown, according to index sequence to six texts of sequence shown in SEQ ID NO.7 to SEQ ID NO.12 The sequencing result in library distinguishes, and the preceding 50bp results of six library sequence sequencings are identical with practical standard nucleic acid sequence, and six As shown in Figures 1 to 6, Fig. 1 to Fig. 6 sequentially corresponds to SEQ ID NO.7 to SEQ ID to the preceding 50bp results of a library sequence sequencing The sequencing result in six libraries of sequence shown in NO.12；Illustrate that library is built into work(, algorithm basecall is accurate.

6th, the assessment of sequencing quality

To compare the relationship of sequencing quality and base, this example uses the sequencing kit of BGISEQ500SE100+10, to height Sequence shown in the library (referred to as high AT libraries) of sequence shown in the SEQ ID NO.7 of AT contents and the SEQ ID NO.9 of high GC content The library (referred to as high GC libraries) of row carries out the sequencing of SE100.

The preparation of DNB and chip manufacturing are identical with " five, library sequencing detection and sequencing accuracy detection ".Only this experiment The DNB in the library of sequence shown in the library of sequence shown in SEQ ID NO.7 and SEQ ID NO.9 is only prepared for, and is carried out The upper machine sequencings of SE100.

The sequencing quality in two libraries, as shown in table 2, sequence shown in the SEQ ID NO.9 of high GC content are compared in analysis Library, Q30 is less than the library of sequence shown in the SEQ ID NO.7 of high AT contents, and error rate is higher than the library of high AT contents. For this purpose, in sequencing technologies later improve, the abundant library of G/C content can be directed to and targetedly optimized.

The sequencing quality in 2 two libraries of table compares

Title	PredQual	G/C content %	Q10%	Q10%	Q10%	EsErr%
							High AT libraries	33	27.05%	99.16	98.02	91.44	0.23
High GC libraries	33	75.47%	98.16	94.18	85.05	0.68

In addition, base and the relationship of mass value are further analyzed, as shown in fig. 7, Fig. 7 is the Q30 distribution maps in high GC libraries, Can significantly it find out, in 60bp, 68bp, 81bp, 91bp, 97bp, there are one apparent downward trends for Q30 figures, corresponding These positions of the sequence, all there are one common traits, i.e., when bases G is followed by A, the sequencing quality of A can be deteriorated, this is for it The optimization of sequencing technologies afterwards provides direction.

As it can be seen that the standard nucleic acid of this example and the library based on standard nucleic acid, during in two generations, can be sequenced, the alkali of sequencing Base Preference and accuracy are assessed, and two generation sequencing qualities are assessed in the accuracy of detection two generations sequencing；Also, for sequencing As a result it with the analysis of base feature, targetedly optimizes, improves the accuracy of nucleic acid sequencing.

The foregoing is a further detailed description of the present application in conjunction with specific implementation manners, it is impossible to assert this Shen Specific implementation please is confined to these explanations.For those of ordinary skill in the art to which this application belongs, it is not taking off Under the premise of conceiving from the application, several simple deduction or replace can also be made.

SEQUENCE LISTING

<110>Shenzhen Hua Da life science institute

<120>For library, reagent and the application of the assessment of two generation sequencing qualities

<130> 17I25566-A23542

<160> 15

<170> PatentIn version 3.3

<210> 1

<211> 150

<212> DNA

<213>Artificial sequence

<400> 1

tacaactaca gataatgggc tggatacatg gaatgattat agatatatta aggaataatg 60

ttaattaatg cctaaattaa ttaatctaag ggggttaata ctatgtgtta attaatctta 120

ttagaatgaa tattattgaa tcaataatta 150

<210> 2

<211> 150

<212> DNA

<213>Artificial sequence

<400> 2

atataatgta atacataata ttaatatatt aattattgta tgattgatat ctattacagt 60

ctagtactga cccgtagaca tatatgcccc cgattaatta cttaggctta ttaataatat 120

ataggaataa taatggaata gcaataatta 150

<210> 3

<211> 150

<212> DNA

<213>Artificial sequence

<400> 3

ccgccgcggt cgcttgtccg gccgccggtc cggcgccggc ggcgcaaagt gccaggccga 60

gccggcgaac cagcggtccg aaaaacacgg acacggtaac ctcaccacga tggccggccg 120

cggcgtccag tgcgcggcgc tagagccggc 150

<210> 4

<211> 150

<212> DNA

<213>Artificial sequence

<400> 4

caaactaccg gcgcggcgct cctccggccg tccgccgccg accggcggcg gcgttccggt 60

gtggcactcc aggtggccgg ttctctgcca agcggcaggc gaaaaatcga cggccaccgc 120

cgaggccgcg gcggagaccg ccggcgcagg 150

<210> 5

<211> 150

<212> DNA

<213>Artificial sequence

<400> 5

gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag tttatcgatc 60

gtatcgaaat gtatgagttt atacaggtcc tacttcaaca agcggcactt tactaccgtg 120

aagaacaacc ccgcacgacg cctaccaacc 150

<210> 6

<211> 150

<212> DNA

<213>Artificial sequence

<400> 6

gacggattcc ctcgctttct attggctgac agtacaagta acataggttg ggtcggttaa 60

ccctgccgtc acaagtggaa cgatgttaat agttgcggaa ccctatgttc ggcggaatac 120

tagaccagtt cattattata gtgctagcca 150

<210> 7

<211> 244

<212> DNA

<213>Artificial sequence

<400> 7

gatatctgca ggcatagaat gaatattatt gaatcaataa ttaaagtcgg aggccaagcg 60

gtcttaggaa gacaaactag tacgtcaact ccttggctca cagaacgaca tggctacgat 120

ccgactttac aactacagat aatgggctgg atacatggaa tgattataga tatattaagg 180

aataatgtta attaatgcct aaattaatta atctaagggg gttaatactt cagcctgtga 240

tatc 244

<210> 8

<211> 244

<212> DNA

<213>Artificial sequence

<400> 8

gatatctgca ggcatgaata ataatggaat agcaataatt aaagtcggag gccaagcggt 60

cttaggaaga caacgatcag taccaactcc ttggctcaca gaacgacatg gctacgatcc 120

gacttatata atgtaataca taatattaat atattaatta ttgtatgatt gttatctatt 180

acagtctagt actgacccgt agacatatat gcccccgatt aattacttat cagcctgtga 240

tatc 244

<210> 9

<211> 244

<212> DNA

<213>Artificial sequence

<400> 9

gatatctgca ggcatcggcc gcggcgtcca gtgcgcggcg ctagagccgg caagtcggag 60

gccaagcggt cttaggaaga caacgctatg taccaactcc ttggctcaca gaacgacatg 120

gctacgatcc gacttccgcc gcggtcgctt gtccggccgc cggtccggcg ccggcggcgc 180

aaagtgccag gccgagccgg cgaaccagcg gtccgaaaaa cacggacact cagcctgtga 240

tatc 244

<210> 10

<211> 244

<212> DNA

<213>Artificial sequence

<400> 10

gatatctgca ggcatcaccg ccgaggccgc ggcggagacc gccggcgcag gaagtcggag 60

gccaagcggt cttaggaaga caacagagtg taccaactcc ttggctcaca gaacgacatg 120

gctacgatcc gacttcaaac taccggcgcg gcgctcctcc ggccgtccgc cgccgaccgg 180

cggcggcgtt ccggtgtggc actccaggtg gccggttctc tgccaagcgt cagcctgtga 240

tatc 244

<210> 11

<211> 244

<212> DNA

<213>Artificial sequence

<400> 11

gatatctgca ggcatgaaga acaaccccgc acgacgccta ccaaccaagt cggaggccaa 60

gcggtcttag gaagacaact gtatcgtaca actccttggc tcacagaacg acatggctac 120

gatccgactt gctgttcgcg gccgatgttc gtataagata taagtttggg tatattccag 180

tttatcgatc gtatcgaaat gtatgagttt atacaggtcc tacttcaact cagcctgtga 240

tatc 244

<210> 12

<211> 244

<212> DNA

<213>Artificial sequence

<400> 12

gatatctgca ggcatactag accagttcat tattatagtg ctagccaaag tcggaggcca 60

agcggtctta ggaagacaaa catcaacgtc aactccttgg ctcacagaac gacatggcta 120

cgatccgact tgacggattc cctcgctttc tattggctga cagtacaagt aacataggtt 180

gggtcggtta accctgccgt cacaagtgga acgatgttaa tagttgcggt cagcctgtga 240

tatc 244

<210> 13

<211> 15

<212> DNA

<213>Artificial sequence

<400> 13

gatatctgca ggcat 15

<210> 14

<211> 15

<212> DNA

<213>Artificial sequence

<400> 14

gatatcacag gctga 15

<210> 15

<211> 30

<212> DNA

<213>Artificial sequence

<400> 15

atgcctgcag atatcgatat cacaggctga 30

Claims

1. a kind of library for being used for the assessment of two generation sequencing qualities, it is characterised in that：The library is with different bases feature The single-stranded DNA banks of known array, and it is connected with joint sequence and index sequence in the library；It is described that there are different alkali It is single-stranded that the single-stranded DNA banks of the known array of base feature include high AT contents single stranded DNA, high GC content single stranded DNA, poly structures At least one of DNA and hairpin structure single stranded DNA；

Preferably, the both ends in the library also have universal primer binding sequence.

2. library according to claim 1, it is characterised in that：Single-stranded DNA banks sequence as shown in SEQ ID NO.7 Sequence shown in row, SEQ ID NO.8, sequence shown in SEQ ID NO.9, sequence, SEQ ID NO.11 shown in SEQ ID NO.10 At least one of sequence composition shown in shown sequence and SEQ ID NO.12.

3. a kind of cloning vector, the cloning vector includes plasmid and Insert Fragment, it is characterised in that：The Insert Fragment includes Library described in claims 1 or 2；

Preferably, the plasmid is pMD18-T or pMD19-T.

4. a kind of engineering bacteria, the engineering bacteria is included described in recipient bacterium and the claim 3 for importing and being stored in recipient bacterium Cloning vector；Preferably, the recipient bacterium is Escherichia coli.

5. a kind of reagent for being used for the assessment of two generation sequencing qualities, it is characterised in that：The reagent is included described in claims 1 or 2 Library, the cloning vector described in claim 3 or the engineering bacteria described in claim 4.

6. reagent according to claim 5, it is characterised in that：Universal primer is further included, the upstream of the universal primer is drawn Object is sequence shown in SEQ ID NO.13, and downstream primer is sequence shown in SEQ ID NO.14.

7. reagent according to claim 5 or 6, it is characterised in that：Further include splint oligo, the splint Oligo is sequence shown in SEQ ID NO.15.

8. the engineering described in cloning vector, claim 4 described in library according to claim 1 or 2, claim 3 Bacterium or claim 5-7 any one of them reagent base and sequencing quality relationship assessment, amplification enzyme base Preference and Accuracy evaluation, Sequenase accuracy evaluation, base signal extraction assessment improve, the sequencing accuracy detection of two generations, build library and arrive The detection of sequencing links error rate builds library to the application in sequencing links scheme optimization.

A kind of 9. method for improving nucleic acid sequencing accuracy, it is characterised in that：Including there is different bases feature using one The single-stranded DNA banks of known array are sequenced, and sequencing result and known array are compared, and statistical analysis is in different bases Deviation is sequenced present in feature, according to the sequencing bias correction sequencing software algorithm, so as to improve nucleic acid sequencing accuracy； It is single-stranded that the single-stranded DNA banks of the known array with different bases feature include high AT contents single stranded DNA, high GC content At least one of DNA, poly structure single stranded DNA and hairpin structure single stranded DNA；

Preferably, the poly structures single stranded DNA includes poly A structures single stranded DNA, poly T structures single stranded DNA, poly G At least one of structure single stranded DNA and poly C-structure single stranded DNAs.

10. according to the method described in claim 9, it is characterized in that：The single-stranded DNA banks is described in claims 1 or 2 Library.