CN108073791B

CN108073791B - Method based on two generation sequencing datas detection target gene structure variation

Info

Publication number: CN108073791B
Application number: CN201711320249.0A
Authority: CN
Inventors: 郎继东; 田埂
Original assignee: Code Gene Technology (suzhou) Co Ltd
Current assignee: Code gene technology (Suzhou) Co., Ltd.
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2019-02-05
Anticipated expiration: 2037-12-12
Also published as: CN108073791A

Abstract

The present invention discloses the method based on two generation sequencing datas detection target gene structure variation.Method of the invention is directly from raw sequencing data, the detection of gene structure variation can be carried out by quick and easy comparison, eliminate the parameter setting of complicated alignment algorithm, not only there is apparent promotion in detection time, and only need single-ended sequencing also achievable analysis detection, data volume is reduced, to further reduce sequencing cost, there has also been higher sensitivity and specificity in detection result.

Description

Method based on two generation sequencing datas detection target gene structure variation

Technical field

The present invention relates generally to genetic test fields, particularly detect target gene structure based on two generation sequencing datas The method of variation.

Background technique

As sequencing cost is lower and lower, with two generation sequencing technologies predicted gene group DNA level structure variations (SVs: Structure variations) methods and techniques it is more and more, currently based on two generation sequencing data detection structures variation There are mainly four types of methods:

The first is the method (Read depth) by sequencing overburden depth.For single sample, to be somebody's turn to do by detection Sample refers to the depth distribution situation of the short sequence (reads) on genome at one to detect；For paired sample (Case- Control), then be by with identify loss in the presence of comparing two samples and repeat multiplication region, the disadvantage is that by reality The amplification skewed popularity and sequencing skewed popularity for testing link are affected, and result is frequently not very accurate.

Second is to match double end sequencing methods (Paired-end sequencing).This method can detecte The structure variations such as insertion, missing, inversion, the transposition of large fragment, but it is limited to the standard deviation (sequencing of the Insert Fragment length of sequencing Borrowing segment refer to before sequencing in the DNA fragmentation length that is interrupted of building DNA sequencing library stage), and it is most Method is too dependent on alignment algorithm, and influence of the parameter setting to result is very big.

The third is the method (Split reads) that sequence reads long segmentation.This method can be by the Soft- on comparing Clip reads accurately to find the breakpoint location of structure variation, but in addition to the repetitive sequence on genome on result influence compared with Other than big, the similarly also excessive setting dependent on alignment algorithm and its parameter.

4th kind is the method (Denovo assembly) assembled.Although this method can more direct detection structure Variation, but the assembling of the short sequence based on the sequencing of two generations is still relatively difficult due to being influenced by repeat region on genome, And it can also be greatly increased in cost.

In conclusion the method based on two generation sequencing technologies predicted gene group DNA level structure variations still needs to higher speed Degree and higher sensitivity and specificity.

Summary of the invention

In order to solve at least partly technical problem in the prior art, the present invention, which is provided, detects mesh based on two generation sequencing datas The method for marking gene structure variation.The analysis of progress gene structure variation that by means of the present invention can be simple and quick, and And analysis result is reliable, high sensitivity, high specificity.Specifically, the present invention includes the following contents.

The first aspect of the present invention provides a kind of method based on two generation sequencing datas detection target gene structure variation, Itself the following steps are included:

(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50；

(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as second Reference sequences library, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library The sum of length；

(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,

If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively, Using the sequence A and B to be compared as the first aligned sequences pair,

If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence to be compared A and B is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A to be compared And B；

(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,

If each sequence of the first aligned sequences centering is exactly matched with the sequence in the second reference sequences library respectively, and And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences It is right,

If the sequence that mispairing number is 1 or more or mispairing number is 0 in second comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes first aligned sequences pair.

The second aspect of the present invention provides another side based on two generation sequencing datas detection target gene structure variation Method comprising following steps:

(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,

If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can be complete The sequence matched is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,

If the sequence that mispairing number is 1 or more or mispairing number is 0 in first comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes sequence A and B to be compared；

(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,

If the first aligned sequences are to sequence complete respectively with the different target gene in first reference sequences library Match, then by first aligned sequences to as the second aligned sequences pair,

If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu Sequence exact matching in first reference sequences library in same gene, then terminate comparison, and remove the first aligned sequences pair.

The method of the first aspect and second aspect according to the present invention, including repeating step (3), (4).It is excellent Selection of land further comprises then determining that there are the target gene structures when the quantity of second aligned sequences pair is 2 or more Otherwise variation determines that the target gene structure variation is not present.

The method of the first aspect and second aspect according to the present invention, wherein the target gene structure variation includes base Because of at least one of fusion, gene inversion and gene displacement.

The method of the first aspect and second aspect according to the present invention, wherein being made of multiple primitive sequencer sequences The length p of each primitive sequencer sequence is 75-350bp, and p > (n+m) × 2 in set.

The method of the first aspect and second aspect according to the present invention, wherein the quantity of the candidate target gene is 2- 100。

Method according to the first aspect of the invention, wherein described first compares using BLAST algorithm, and second ratio To using SOAP algorithm.

Method according to the second aspect of the invention, wherein described first compares using SOAP algorithm, and second ratio To using BLAST algorithm.

The method of the first aspect and second aspect according to the present invention, wherein the multiple primitive sequencer sequence is both-end The data collating sequence of sequencing sequence or single-ended sequencing sequence.

Method of the invention can carry out gene structure change by quick and easy comparison directly from raw sequencing data Different detection eliminates the parameter setting of complicated alignment algorithm, not only has apparent promotion in detection time, and only need It is single-ended that also achievable analysis detection is sequenced, data volume is reduced, to further reduce sequencing cost, in detection result There has also been higher sensitivity and specificity.

Specific embodiment

The existing various exemplary embodiment that the present invention will be described in detail, the detailed description are not considered as to limit of the invention System, and it is understood as the more detailed description to certain aspects of the invention, characteristic and embodiment.

It should be understood that it is to describe special embodiment that heretofore described term, which is only, it is not intended to limit this hair It is bright.In addition, for the numberical range in the present invention, it is thus understood that specifically disclose the range upper and lower bound and they it Between each median.Median and any other statement value in any statement value or stated ranges or in the range Lesser range is also included in the present invention each of between interior median.These small range of upper and lower bounds can be independent Ground includes or excludes in range.

Unless otherwise stated, all technical and scientific terms used herein has the routine in field of the present invention The normally understood identical meanings of technical staff.Although the present invention only describes preferred method and material, of the invention Implement or also can be used and similar or equivalent any method and material described herein in testing.The institute mentioned in this specification There is document to be incorporated by reference into, to disclosure and description method relevant to the document and/or material.It is incorporated to any When document conflicts, it is subject to the content of this specification.

It is open term, i.e., about "comprising" used herein, " comprising ", " having ", " containing " etc. Mean including but not limited to.About "and/or" used herein, including any of the things or whole combinations.

The method of the first aspect of the present invention is the method that the structure variation of target gene is detected based on two generation sequencing datas, It includes following 4 steps:

(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using human genomic sequence as Two reference sequences libraries, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library The sum of column length；

If sequence A and B to be compared is exactly matched with the sequence of different genes in the first reference sequences library respectively, by institute Sequence A and B to be compared is stated as the first aligned sequences pair,

Each step described in detail below.

Step (1):

In step (1) of the invention, primitive sequencer sequence (sometimes referred to as original series) is the sequence of two generations sequencing, respectively The length p of sequence is generally between 75-350bp, between preferably 100-200bp.Primitive sequencer sequence includes single-ended sequencing sequence, Or the data collating sequence of both-end sequencing sequence.

It in method of the invention, needs to handle each primitive sequencer sequence, including is intercepted respectively from 5 ' ends and 3 ' ends Fall m base, wherein integer of the m between 0-20, preferably 5-10, to improve the sensitivity of detection.In certain embodiments In, m can be 0, it can use 5 ' ends of the original series directly acquired and 3 ' terminal sequences.

In present invention, it is desirable to taking two sections after interception in primitive sequencer sequence to be used as sequence A and B to be compared.Sequence to be compared Column A and B length be not particularly limited, generally may respectively be n base, wherein integer of the n between 27-50, preferably 30-45 it Between integer.The length of sequence A and B to be compared can be identical, can not also be identical.In certain embodiments, sequence to be compared The length for arranging A and B is identical, and is 32-36bp.If the length of n is too short, the accuracy of testing result can be reduced.It is another Aspect will affect the speed of detection if the length of n is too long.When integer of the n between 27-50, guaranteeing detection speed There is good accuracy simultaneously.

It should be noted that primitive sequencer sequence of the invention is the sequence of nucleic acids in samples, do not include in order to be sequenced and The label or label of introducing, such as library sequence etc..

In the present invention, the collection of all original series of two generations sequencing can be by the set that multiple primitive sequencer sequences form It closes, is also possible to the set of part of sequence.As long as described gather the sequence for covering target gene, it is not particularly limited collection The composition of conjunction.It is preferred that all two generation sequencing sequence set of sample.

Step (2):

The step of step (2) is selected reference sequences library, wherein reference sequences library includes the first reference sequences library and second Reference sequences library.

First reference sequences library is made of the sequence of multiple candidate target genes, and wherein candidate target gene is to include target Gene including gene, it is preferable that the quantity of candidate target gene is 2 or more, preferably 2-100.If quantity is excessive, can drop Low detection speed.If quantity is too small, although detection speed with higher, the target gene range of detection is small, application Range is restricted.For example, the first reference sequences library need to include at least when detecting the fusion of the two genes of ALK and EML4 ALK and EML4.Preferably, in the first reference sequences library, the sequence of each candidate target gene is deposited in the form of mutually identifiable ?.It can judge corresponding gene when sequence A and/or B exactly matched to be compared, easily as a result, so as to further identify It is the structure variation of which kind of target gene.

Second reference sequences library is made of the full sequence of entire species gene group, i.e. whole genome sequence.It can be used Known full-length genome database.For example, entire human genomic sequence conduct can be used when target gene is human gene Second reference sequences library.It can origin be derived from be currently known any data composition.For example, in exemplary arrangement, it can source In the sequence library of http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/.

In the first reference sequences library of the invention the sum of each sequence length less than the second reference sequences library sequence length it With.It is preferred that the sum of each sequence length is the 1/10-1/ of the sum of the sequence length in the second reference sequences library in the first reference sequences library 10000, more preferable 1/10-1/1000, further preferably 1/100-1/1000.

Step (3):

Step (3) is that the step for rapidly finding out the first aligned sequences pair corresponding to potential structure variation is compared by first Suddenly comprising sequence A and B to be compared is compared with the sequence in the first reference sequences library respectively.If sequence A to be compared (" exact matching " is meant that mistake in the present invention with the sequence exact matching of different genes in the first reference sequences library respectively with B It is 0), then using sequence A and B to be compared as the first aligned sequences pair, if sequence A and B and first to be compared refers to sequence with number Corresponding sequence cannot exactly match and (" cannot exactly match " in the present invention or " Incomplete matching " is meant that mispairing in column library Number is 1 or more, or cannot be matched completely) or sequence A and B to be compared and in same gene in the first reference sequences library Sequence exact matching, then terminate comparison, and remove sequence A and B to be compared.

In the present invention, the first comparison can be used known any algorithm and carry out.Preferably, it first compares using BLAST Algorithm carries out.

Since the sequence of the structure variation comprising target gene is very small sequence for whole gene group data Column, and two generations sequencing data itself contain numerous original series, if each original series respectively with whole gene group Data are compared respectively, can greatly increase the time used when comparison, to keep detection speed very slow.As described above, the The sequence data in one reference sequences library is small, is substantially shorter the first comparison time of each original series, and can exclude not having There are most of original series comprising structure variation, so as to rapidly find out the potential sequence comprising structure variation.

Step (4):

Step (4) is to be compared by second to verify the first comparison result, thus the step of improving detection accuracy, packet It includes and compares the first aligned sequences pair and the sequence progress second in the second reference sequences library, if the first aligned sequences centering is each Sequence is exactly matched with sequence corresponding in the second reference sequences library respectively, and the sequence that can be exactly matched compares to be unique To sequence to (in the present invention, " unique compare " is meant that the upper reference sequences of comparison are primary and only once), then by the first ratio To sequence to as the second aligned sequences pair, if the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result Column then terminate comparison, and remove the first aligned sequences pair to not being unique aligned sequences pair.

In the present invention, the second comparison can be used known any algorithm and carry out.Preferably, it second compares using SOAP calculation Method carries out.

Due to the complexity of gene order, only the comparison result in the first reference sequences library may have error, pass through The inspection in two reference sequences libraries, substantially increases the accuracy of testing result.

It should be noted that in certain embodiments, can analyze each original series in set, but this is simultaneously Do not mean that each original series will pass through step (1)-(4).Most of original series when step (3) by being removed. In addition, small part original series can be removed at step (4).It in certain embodiments, can be only former to the part in set Beginning sequence is analyzed.For example, when obtaining second aligned sequences to (for example, 2 or more the second aligned sequences to), It terminates and compares, and other original series are no longer analyzed.

In certain embodiments, the method for the present invention includes step (3) and (4) are repeated, to obtain multiple second ratios To sequence pair.In certain embodiments, if there is upper second aligned sequences pair, then it can determine that the structure there are gene Variation.Preferably, when the quantity x of second aligned sequences pair is 2 or more, determine that there are target gene structure changes It is different, otherwise determine that the target gene structure variation is not present.When x is 2 or more, show in the original series for there are 2 or more There are structure variations, to further increase the accuracy of detection.

Although showing each step in the form of present invention digital number as step (1)-(4), number herein is compiled Number be only distinguish each step purpose, be not offered as between each step sequentially.

Method according to the second aspect of the invention is another structure that target gene is detected based on two generation sequencing datas The method of variation comprising following 4 steps:

(1) identical as step (1) in first aspect present invention；

(2) identical as step (2) in first aspect present invention；

If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu Sequence exact matching in first reference sequences library in same gene, then terminate subsequent second and compare, and removes first and compare sequence Column pair.

In method according to a second aspect of the present invention, in addition to above-mentioned first aligned sequences to and the second aligned sequences pair and this Except difference in the method for invention first aspect, remaining title, term are identical as in the method for first aspect present invention, Details are not described herein.

Embodiment 1

It chooses actual probes capture and is sequenced with the both-end (Paired-end) that NextSeq500 sequencer obtains And the data Data-0001 of a length of 151bp is read, analysis inspection is carried out to the sequencing data with the method for first aspect present invention It surveys.The embodiment sample is the positive sample for being ALK-EML4 by Arms method validation.

1. the sequence of ALK gene and EML4 gene is extracted on human genome, as target gene reference sequences.

2. the both-end sequencing data of embodiment sample Data-0001 is merged into single-ended (Single-end) sequencing data, Each 10bp before and after sequence reads is fallen in interception, and retains each 35bp in front and back of remaining sequence reads, with two sections of 35bp sequences, altogether The long sequence data as subsequent analysis of the reading of 70bp.

3. by processed data application BLAST algorithm obtained in step 2 and the comprising ALK gene and EML4 gene One reference sequences library is compared, if in the comparison result of two sections of sequences, mispairing is not present, i.e. mispairing number is 0；And it compares Two sequences to be compared are exactly matched with the corresponding position of ALK gene and EML4 gene respectively as the result is shown, then will it is described to than To sequence to as the first aligned sequences pair, and flag sequence is to for ID1, if mispairing number is 1 in first comparison result Above or cannot compare, although two sequences to be compared match completely, mated position be located at simultaneously ALK gene or EML4 gene then terminates the comparison of the original series.

4. comparing the sequence for being labeled as ID1 obtained in step 3 to SOAP algorithm to human genomic sequence library (from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/).If the sequence of ID1 Each sequence of centering is exactly matched with sequence corresponding in human genomic sequence library respectively, and the sequence pair that can be exactly matched For unique aligned sequences pair, then it is to label by the sequence of the ID1.

If the sequence that mispairing number is 1 or more or mispairing number is 0 in this comparison result is not to being unique aligned sequences It is right, then comparison is terminated, and remove the corresponding data of the ID1.

3 and 4 are repeated the above steps, if obtaining 2 pairs or more marks the sequence pair for being, for positive structure variation As a result, being labeled as Se-SV；It is negative findings when it is less than or equal to 1.

Embodiment 2

Other than using sample Data-0002, gene structure variation is detected in the same manner as example 1.

Reference example 1

The obtained both-end sequencing data of sample Data-0001 and Data-0002 is used into BWA algorithm (Aln) and two respectively Kind sequence reads long split plot design (Split reads) and carries out analysis detection, is as a result denoted as Aln-S1, Aln-S2 respectively.

Reference example 2

BWA algorithm (Mem) and identical two kinds of sequences are used respectively by what sample Data-0001 and Data-0002 were obtained It reads long split plot design (Split reads) and carries out analysis detection, be as a result denoted as Mem-S1, Mem-S2.

Compare the detection situation and runing time of method and reference example 1 and 2 three kind of method of the invention, statistics such as following table is united Shown in meter:

The detection case comparison result of 1 three kinds of methods of table

Without departing substantially from the scope or spirit of the invention, the specific embodiment of description of the invention can be done more Kind improvements and changes, this will be apparent to those skilled in the art.Other realities obtained by specification of the invention Applying mode for technical personnel is apparent obtain.Present specification and embodiment are merely exemplary.

Claims

1. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:

(1) it in the set being made of multiple primitive sequencer sequences, is cut respectively for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base is taken down, 5 ' ends and 3 ' each n bases in end of sequence after intercepting then is taken, constitutes sequence A to be compared and sequence to be compared B, wherein integer of the m between 0-20, integer of the n between 27-50；

(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as the second reference Sequence library, wherein the sum of each sequence length is less than the sequence length in second reference sequences library in first reference sequences library The sum of；

If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively, by institute Sequence A and B to be compared is stated as the first aligned sequences pair,

If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence A and B to be compared It is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A and B to be compared；

If each sequence of the first aligned sequences centering is exactly matched with sequence corresponding in the second reference sequences library respectively, and And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences It is right,

If the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result to not being unique aligned sequences pair, Comparison is then terminated, and removes first aligned sequences pair.

2. according to the method described in claim 1, including step (3) and (4) are repeated.

3. according to the method described in claim 2, further comprise when the quantity x of second aligned sequences pair be 2 or more when, Then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.

4. according to the method described in claim 1, wherein the target gene structure variation include Gene Fusion, gene inversion and At least one of gene displacement.

5. according to the method described in claim 1, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences The length p of sequence is 75-350bp, and p > (n+m) × 2.

6. according to the method described in claim 1, wherein the quantity of the candidate target gene is 2-100.

7. according to the method described in claim 1, wherein described first comparing using BLAST algorithm, and second comparison is adopted With SOAP algorithm.

8. according to the method described in claim 1, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed And sequence or single-ended sequencing sequence.

9. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:

If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can exactly match Sequence is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,

If the sequence that mispairing number is 1 or more or mispairing number is 0 in the first comparison result to not being unique aligned sequences pair, Comparison is then terminated, and removes sequence A and B to be compared；

If the first aligned sequences to respectively with the sequence of the different target gene in first reference sequences library exactly match, By first aligned sequences to as the second aligned sequences pair,

If the first aligned sequences pair cannot be exactly matched with the sequence in first reference sequences library, or be referred to first Sequence exact matching in sequence library in same gene, then terminate comparison, and remove the first aligned sequences pair.

10. according to the method described in claim 9, including step (3) and (4) are repeated.

11. according to the method described in claim 10, further comprising when the quantity x of second aligned sequences pair is 2 or more When, then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.

12. according to the method described in claim 9, wherein the target gene structure variation includes Gene Fusion, gene inversion With gene displacement at least one of.

13. according to the method described in claim 9, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences The length p of sequence is 75-350bp, and p > (n+m) × 2.

14. according to the method described in claim 9, wherein the quantity of the candidate target gene is 2-100.

15. according to the method described in claim 9, wherein described first comparing using SOAP algorithm, and second comparison is adopted Use BLAST algorithm.

16. according to the method described in claim 9, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed And sequence or single-ended sequencing sequence.