CN108073791B - Method based on two generation sequencing datas detection target gene structure variation - Google Patents
Method based on two generation sequencing datas detection target gene structure variation Download PDFInfo
- Publication number
- CN108073791B CN108073791B CN201711320249.0A CN201711320249A CN108073791B CN 108073791 B CN108073791 B CN 108073791B CN 201711320249 A CN201711320249 A CN 201711320249A CN 108073791 B CN108073791 B CN 108073791B
- Authority
- CN
- China
- Prior art keywords
- sequence
- library
- sequences
- compared
- reference sequences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention discloses the method based on two generation sequencing datas detection target gene structure variation.Method of the invention is directly from raw sequencing data, the detection of gene structure variation can be carried out by quick and easy comparison, eliminate the parameter setting of complicated alignment algorithm, not only there is apparent promotion in detection time, and only need single-ended sequencing also achievable analysis detection, data volume is reduced, to further reduce sequencing cost, there has also been higher sensitivity and specificity in detection result.
Description
Technical field
The present invention relates generally to genetic test fields, particularly detect target gene structure based on two generation sequencing datas
The method of variation.
Background technique
As sequencing cost is lower and lower, with two generation sequencing technologies predicted gene group DNA level structure variations (SVs:
Structure variations) methods and techniques it is more and more, currently based on two generation sequencing data detection structures variation
There are mainly four types of methods:
The first is the method (Read depth) by sequencing overburden depth.For single sample, to be somebody's turn to do by detection
Sample refers to the depth distribution situation of the short sequence (reads) on genome at one to detect;For paired sample (Case-
Control), then be by with identify loss in the presence of comparing two samples and repeat multiplication region, the disadvantage is that by reality
The amplification skewed popularity and sequencing skewed popularity for testing link are affected, and result is frequently not very accurate.
Second is to match double end sequencing methods (Paired-end sequencing).This method can detecte
The structure variations such as insertion, missing, inversion, the transposition of large fragment, but it is limited to the standard deviation (sequencing of the Insert Fragment length of sequencing
Borrowing segment refer to before sequencing in the DNA fragmentation length that is interrupted of building DNA sequencing library stage), and it is most
Method is too dependent on alignment algorithm, and influence of the parameter setting to result is very big.
The third is the method (Split reads) that sequence reads long segmentation.This method can be by the Soft- on comparing
Clip reads accurately to find the breakpoint location of structure variation, but in addition to the repetitive sequence on genome on result influence compared with
Other than big, the similarly also excessive setting dependent on alignment algorithm and its parameter.
4th kind is the method (Denovo assembly) assembled.Although this method can more direct detection structure
Variation, but the assembling of the short sequence based on the sequencing of two generations is still relatively difficult due to being influenced by repeat region on genome,
And it can also be greatly increased in cost.
In conclusion the method based on two generation sequencing technologies predicted gene group DNA level structure variations still needs to higher speed
Degree and higher sensitivity and specificity.
Summary of the invention
In order to solve at least partly technical problem in the prior art, the present invention, which is provided, detects mesh based on two generation sequencing datas
The method for marking gene structure variation.The analysis of progress gene structure variation that by means of the present invention can be simple and quick, and
And analysis result is reliable, high sensitivity, high specificity.Specifically, the present invention includes the following contents.
The first aspect of the present invention provides a kind of method based on two generation sequencing datas detection target gene structure variation,
Itself the following steps are included:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends
M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared
Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as second
Reference sequences library, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library
The sum of length;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively,
Using the sequence A and B to be compared as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence to be compared
A and B is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A to be compared
And B;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with the sequence in the second reference sequences library respectively, and
And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences
It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in second comparison result is not to being uniquely to compare
Sequence pair then terminates comparison, and removes first aligned sequences pair.
The second aspect of the present invention provides another side based on two generation sequencing datas detection target gene structure variation
Method comprising following steps:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends
M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared
Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as second
Reference sequences library, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library
The sum of length;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can be complete
The sequence matched is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in first comparison result is not to being uniquely to compare
Sequence pair then terminates comparison, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences are to sequence complete respectively with the different target gene in first reference sequences library
Match, then by first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu
Sequence exact matching in first reference sequences library in same gene, then terminate comparison, and remove the first aligned sequences pair.
The method of the first aspect and second aspect according to the present invention, including repeating step (3), (4).It is excellent
Selection of land further comprises then determining that there are the target gene structures when the quantity of second aligned sequences pair is 2 or more
Otherwise variation determines that the target gene structure variation is not present.
The method of the first aspect and second aspect according to the present invention, wherein the target gene structure variation includes base
Because of at least one of fusion, gene inversion and gene displacement.
The method of the first aspect and second aspect according to the present invention, wherein being made of multiple primitive sequencer sequences
The length p of each primitive sequencer sequence is 75-350bp, and p > (n+m) × 2 in set.
The method of the first aspect and second aspect according to the present invention, wherein the quantity of the candidate target gene is 2-
100。
Method according to the first aspect of the invention, wherein described first compares using BLAST algorithm, and second ratio
To using SOAP algorithm.
Method according to the second aspect of the invention, wherein described first compares using SOAP algorithm, and second ratio
To using BLAST algorithm.
The method of the first aspect and second aspect according to the present invention, wherein the multiple primitive sequencer sequence is both-end
The data collating sequence of sequencing sequence or single-ended sequencing sequence.
Method of the invention can carry out gene structure change by quick and easy comparison directly from raw sequencing data
Different detection eliminates the parameter setting of complicated alignment algorithm, not only has apparent promotion in detection time, and only need
It is single-ended that also achievable analysis detection is sequenced, data volume is reduced, to further reduce sequencing cost, in detection result
There has also been higher sensitivity and specificity.
Specific embodiment
The existing various exemplary embodiment that the present invention will be described in detail, the detailed description are not considered as to limit of the invention
System, and it is understood as the more detailed description to certain aspects of the invention, characteristic and embodiment.
It should be understood that it is to describe special embodiment that heretofore described term, which is only, it is not intended to limit this hair
It is bright.In addition, for the numberical range in the present invention, it is thus understood that specifically disclose the range upper and lower bound and they it
Between each median.Median and any other statement value in any statement value or stated ranges or in the range
Lesser range is also included in the present invention each of between interior median.These small range of upper and lower bounds can be independent
Ground includes or excludes in range.
Unless otherwise stated, all technical and scientific terms used herein has the routine in field of the present invention
The normally understood identical meanings of technical staff.Although the present invention only describes preferred method and material, of the invention
Implement or also can be used and similar or equivalent any method and material described herein in testing.The institute mentioned in this specification
There is document to be incorporated by reference into, to disclosure and description method relevant to the document and/or material.It is incorporated to any
When document conflicts, it is subject to the content of this specification.
It is open term, i.e., about "comprising" used herein, " comprising ", " having ", " containing " etc.
Mean including but not limited to.About "and/or" used herein, including any of the things or whole combinations.
The method of the first aspect of the present invention is the method that the structure variation of target gene is detected based on two generation sequencing datas,
It includes following 4 steps:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends
M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared
Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using human genomic sequence as
Two reference sequences libraries, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library
The sum of column length;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different genes in the first reference sequences library respectively, by institute
Sequence A and B to be compared is stated as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence to be compared
A and B is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A to be compared
And B;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with the sequence in the second reference sequences library respectively, and
And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences
It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in second comparison result is not to being uniquely to compare
Sequence pair then terminates comparison, and removes first aligned sequences pair.
Each step described in detail below.
Step (1):
In step (1) of the invention, primitive sequencer sequence (sometimes referred to as original series) is the sequence of two generations sequencing, respectively
The length p of sequence is generally between 75-350bp, between preferably 100-200bp.Primitive sequencer sequence includes single-ended sequencing sequence,
Or the data collating sequence of both-end sequencing sequence.
It in method of the invention, needs to handle each primitive sequencer sequence, including is intercepted respectively from 5 ' ends and 3 ' ends
Fall m base, wherein integer of the m between 0-20, preferably 5-10, to improve the sensitivity of detection.In certain embodiments
In, m can be 0, it can use 5 ' ends of the original series directly acquired and 3 ' terminal sequences.
In present invention, it is desirable to taking two sections after interception in primitive sequencer sequence to be used as sequence A and B to be compared.Sequence to be compared
Column A and B length be not particularly limited, generally may respectively be n base, wherein integer of the n between 27-50, preferably 30-45 it
Between integer.The length of sequence A and B to be compared can be identical, can not also be identical.In certain embodiments, sequence to be compared
The length for arranging A and B is identical, and is 32-36bp.If the length of n is too short, the accuracy of testing result can be reduced.It is another
Aspect will affect the speed of detection if the length of n is too long.When integer of the n between 27-50, guaranteeing detection speed
There is good accuracy simultaneously.
It should be noted that primitive sequencer sequence of the invention is the sequence of nucleic acids in samples, do not include in order to be sequenced and
The label or label of introducing, such as library sequence etc..
In the present invention, the collection of all original series of two generations sequencing can be by the set that multiple primitive sequencer sequences form
It closes, is also possible to the set of part of sequence.As long as described gather the sequence for covering target gene, it is not particularly limited collection
The composition of conjunction.It is preferred that all two generation sequencing sequence set of sample.
Step (2):
The step of step (2) is selected reference sequences library, wherein reference sequences library includes the first reference sequences library and second
Reference sequences library.
First reference sequences library is made of the sequence of multiple candidate target genes, and wherein candidate target gene is to include target
Gene including gene, it is preferable that the quantity of candidate target gene is 2 or more, preferably 2-100.If quantity is excessive, can drop
Low detection speed.If quantity is too small, although detection speed with higher, the target gene range of detection is small, application
Range is restricted.For example, the first reference sequences library need to include at least when detecting the fusion of the two genes of ALK and EML4
ALK and EML4.Preferably, in the first reference sequences library, the sequence of each candidate target gene is deposited in the form of mutually identifiable
?.It can judge corresponding gene when sequence A and/or B exactly matched to be compared, easily as a result, so as to further identify
It is the structure variation of which kind of target gene.
Second reference sequences library is made of the full sequence of entire species gene group, i.e. whole genome sequence.It can be used
Known full-length genome database.For example, entire human genomic sequence conduct can be used when target gene is human gene
Second reference sequences library.It can origin be derived from be currently known any data composition.For example, in exemplary arrangement, it can source
In the sequence library of http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/.
In the first reference sequences library of the invention the sum of each sequence length less than the second reference sequences library sequence length it
With.It is preferred that the sum of each sequence length is the 1/10-1/ of the sum of the sequence length in the second reference sequences library in the first reference sequences library
10000, more preferable 1/10-1/1000, further preferably 1/100-1/1000.
Step (3):
Step (3) is that the step for rapidly finding out the first aligned sequences pair corresponding to potential structure variation is compared by first
Suddenly comprising sequence A and B to be compared is compared with the sequence in the first reference sequences library respectively.If sequence A to be compared
(" exact matching " is meant that mistake in the present invention with the sequence exact matching of different genes in the first reference sequences library respectively with B
It is 0), then using sequence A and B to be compared as the first aligned sequences pair, if sequence A and B and first to be compared refers to sequence with number
Corresponding sequence cannot exactly match and (" cannot exactly match " in the present invention or " Incomplete matching " is meant that mispairing in column library
Number is 1 or more, or cannot be matched completely) or sequence A and B to be compared and in same gene in the first reference sequences library
Sequence exact matching, then terminate comparison, and remove sequence A and B to be compared.
In the present invention, the first comparison can be used known any algorithm and carry out.Preferably, it first compares using BLAST
Algorithm carries out.
Since the sequence of the structure variation comprising target gene is very small sequence for whole gene group data
Column, and two generations sequencing data itself contain numerous original series, if each original series respectively with whole gene group
Data are compared respectively, can greatly increase the time used when comparison, to keep detection speed very slow.As described above, the
The sequence data in one reference sequences library is small, is substantially shorter the first comparison time of each original series, and can exclude not having
There are most of original series comprising structure variation, so as to rapidly find out the potential sequence comprising structure variation.
Step (4):
Step (4) is to be compared by second to verify the first comparison result, thus the step of improving detection accuracy, packet
It includes and compares the first aligned sequences pair and the sequence progress second in the second reference sequences library, if the first aligned sequences centering is each
Sequence is exactly matched with sequence corresponding in the second reference sequences library respectively, and the sequence that can be exactly matched compares to be unique
To sequence to (in the present invention, " unique compare " is meant that the upper reference sequences of comparison are primary and only once), then by the first ratio
To sequence to as the second aligned sequences pair, if the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result
Column then terminate comparison, and remove the first aligned sequences pair to not being unique aligned sequences pair.
In the present invention, the second comparison can be used known any algorithm and carry out.Preferably, it second compares using SOAP calculation
Method carries out.
Due to the complexity of gene order, only the comparison result in the first reference sequences library may have error, pass through
The inspection in two reference sequences libraries, substantially increases the accuracy of testing result.
It should be noted that in certain embodiments, can analyze each original series in set, but this is simultaneously
Do not mean that each original series will pass through step (1)-(4).Most of original series when step (3) by being removed.
In addition, small part original series can be removed at step (4).It in certain embodiments, can be only former to the part in set
Beginning sequence is analyzed.For example, when obtaining second aligned sequences to (for example, 2 or more the second aligned sequences to),
It terminates and compares, and other original series are no longer analyzed.
In certain embodiments, the method for the present invention includes step (3) and (4) are repeated, to obtain multiple second ratios
To sequence pair.In certain embodiments, if there is upper second aligned sequences pair, then it can determine that the structure there are gene
Variation.Preferably, when the quantity x of second aligned sequences pair is 2 or more, determine that there are target gene structure changes
It is different, otherwise determine that the target gene structure variation is not present.When x is 2 or more, show in the original series for there are 2 or more
There are structure variations, to further increase the accuracy of detection.
Although showing each step in the form of present invention digital number as step (1)-(4), number herein is compiled
Number be only distinguish each step purpose, be not offered as between each step sequentially.
Method according to the second aspect of the invention is another structure that target gene is detected based on two generation sequencing datas
The method of variation comprising following 4 steps:
(1) identical as step (1) in first aspect present invention;
(2) identical as step (2) in first aspect present invention;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can be complete
The sequence matched is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in first comparison result is not to being uniquely to compare
Sequence pair then terminates comparison, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences are to sequence complete respectively with the different target gene in first reference sequences library
Match, then by first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu
Sequence exact matching in first reference sequences library in same gene, then terminate subsequent second and compare, and removes first and compare sequence
Column pair.
In method according to a second aspect of the present invention, in addition to above-mentioned first aligned sequences to and the second aligned sequences pair and this
Except difference in the method for invention first aspect, remaining title, term are identical as in the method for first aspect present invention,
Details are not described herein.
Embodiment 1
It chooses actual probes capture and is sequenced with the both-end (Paired-end) that NextSeq500 sequencer obtains
And the data Data-0001 of a length of 151bp is read, analysis inspection is carried out to the sequencing data with the method for first aspect present invention
It surveys.The embodiment sample is the positive sample for being ALK-EML4 by Arms method validation.
1. the sequence of ALK gene and EML4 gene is extracted on human genome, as target gene reference sequences.
2. the both-end sequencing data of embodiment sample Data-0001 is merged into single-ended (Single-end) sequencing data,
Each 10bp before and after sequence reads is fallen in interception, and retains each 35bp in front and back of remaining sequence reads, with two sections of 35bp sequences, altogether
The long sequence data as subsequent analysis of the reading of 70bp.
3. by processed data application BLAST algorithm obtained in step 2 and the comprising ALK gene and EML4 gene
One reference sequences library is compared, if in the comparison result of two sections of sequences, mispairing is not present, i.e. mispairing number is 0;And it compares
Two sequences to be compared are exactly matched with the corresponding position of ALK gene and EML4 gene respectively as the result is shown, then will it is described to than
To sequence to as the first aligned sequences pair, and flag sequence is to for ID1, if mispairing number is 1 in first comparison result
Above or cannot compare, although two sequences to be compared match completely, mated position be located at simultaneously ALK gene or
EML4 gene then terminates the comparison of the original series.
4. comparing the sequence for being labeled as ID1 obtained in step 3 to SOAP algorithm to human genomic sequence library
(from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/).If the sequence of ID1
Each sequence of centering is exactly matched with sequence corresponding in human genomic sequence library respectively, and the sequence pair that can be exactly matched
For unique aligned sequences pair, then it is to label by the sequence of the ID1.
If the sequence that mispairing number is 1 or more or mispairing number is 0 in this comparison result is not to being unique aligned sequences
It is right, then comparison is terminated, and remove the corresponding data of the ID1.
3 and 4 are repeated the above steps, if obtaining 2 pairs or more marks the sequence pair for being, for positive structure variation
As a result, being labeled as Se-SV;It is negative findings when it is less than or equal to 1.
Embodiment 2
Other than using sample Data-0002, gene structure variation is detected in the same manner as example 1.
Reference example 1
The obtained both-end sequencing data of sample Data-0001 and Data-0002 is used into BWA algorithm (Aln) and two respectively
Kind sequence reads long split plot design (Split reads) and carries out analysis detection, is as a result denoted as Aln-S1, Aln-S2 respectively.
Reference example 2
BWA algorithm (Mem) and identical two kinds of sequences are used respectively by what sample Data-0001 and Data-0002 were obtained
It reads long split plot design (Split reads) and carries out analysis detection, be as a result denoted as Mem-S1, Mem-S2.
Compare the detection situation and runing time of method and reference example 1 and 2 three kind of method of the invention, statistics such as following table is united
Shown in meter:
The detection case comparison result of 1 three kinds of methods of table
Without departing substantially from the scope or spirit of the invention, the specific embodiment of description of the invention can be done more
Kind improvements and changes, this will be apparent to those skilled in the art.Other realities obtained by specification of the invention
Applying mode for technical personnel is apparent obtain.Present specification and embodiment are merely exemplary.
Claims (16)
1. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:
(1) it in the set being made of multiple primitive sequencer sequences, is cut respectively for each primitive sequencer sequence from 5 ' ends and 3 ' ends
M base is taken down, 5 ' ends and 3 ' each n bases in end of sequence after intercepting then is taken, constitutes sequence A to be compared and sequence to be compared
B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as the second reference
Sequence library, wherein the sum of each sequence length is less than the sequence length in second reference sequences library in first reference sequences library
The sum of;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively, by institute
Sequence A and B to be compared is stated as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence A and B to be compared
It is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with sequence corresponding in the second reference sequences library respectively, and
And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences
It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result to not being unique aligned sequences pair,
Comparison is then terminated, and removes first aligned sequences pair.
2. according to the method described in claim 1, including step (3) and (4) are repeated.
3. according to the method described in claim 2, further comprise when the quantity x of second aligned sequences pair be 2 or more when,
Then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.
4. according to the method described in claim 1, wherein the target gene structure variation include Gene Fusion, gene inversion and
At least one of gene displacement.
5. according to the method described in claim 1, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences
The length p of sequence is 75-350bp, and p > (n+m) × 2.
6. according to the method described in claim 1, wherein the quantity of the candidate target gene is 2-100.
7. according to the method described in claim 1, wherein described first comparing using BLAST algorithm, and second comparison is adopted
With SOAP algorithm.
8. according to the method described in claim 1, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed
And sequence or single-ended sequencing sequence.
9. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:
(1) it in the set being made of multiple primitive sequencer sequences, is cut respectively for each primitive sequencer sequence from 5 ' ends and 3 ' ends
M base is taken down, 5 ' ends and 3 ' each n bases in end of sequence after intercepting then is taken, constitutes sequence A to be compared and sequence to be compared
B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as the second reference
Sequence library, wherein the sum of each sequence length is less than the sequence length in second reference sequences library in first reference sequences library
The sum of;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can exactly match
Sequence is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in the first comparison result to not being unique aligned sequences pair,
Comparison is then terminated, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences to respectively with the sequence of the different target gene in first reference sequences library exactly match,
By first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with the sequence in first reference sequences library, or be referred to first
Sequence exact matching in sequence library in same gene, then terminate comparison, and remove the first aligned sequences pair.
10. according to the method described in claim 9, including step (3) and (4) are repeated.
11. according to the method described in claim 10, further comprising when the quantity x of second aligned sequences pair is 2 or more
When, then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.
12. according to the method described in claim 9, wherein the target gene structure variation includes Gene Fusion, gene inversion
With gene displacement at least one of.
13. according to the method described in claim 9, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences
The length p of sequence is 75-350bp, and p > (n+m) × 2.
14. according to the method described in claim 9, wherein the quantity of the candidate target gene is 2-100.
15. according to the method described in claim 9, wherein described first comparing using SOAP algorithm, and second comparison is adopted
Use BLAST algorithm.
16. according to the method described in claim 9, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed
And sequence or single-ended sequencing sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711320249.0A CN108073791B (en) | 2017-12-12 | 2017-12-12 | Method based on two generation sequencing datas detection target gene structure variation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711320249.0A CN108073791B (en) | 2017-12-12 | 2017-12-12 | Method based on two generation sequencing datas detection target gene structure variation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108073791A CN108073791A (en) | 2018-05-25 |
CN108073791B true CN108073791B (en) | 2019-02-05 |
Family
ID=62158249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711320249.0A Active CN108073791B (en) | 2017-12-12 | 2017-12-12 | Method based on two generation sequencing datas detection target gene structure variation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108073791B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114807398A (en) * | 2018-10-30 | 2022-07-29 | 厦门极元科技有限公司 | Identification method and device, and typing method and device for salmonella in metagenome |
CN109727644B (en) * | 2018-11-12 | 2021-09-07 | 山东省医学科学院基础医学研究所 | Venn diagram making method and system based on microbial genome second-generation sequencing data |
CN109652513B (en) * | 2019-02-25 | 2022-08-23 | 元码基因科技(北京)股份有限公司 | Method and kit for accurately detecting individual mutation of liquid biopsy based on second-generation sequencing technology |
CN110085284B (en) * | 2019-04-29 | 2021-02-26 | 深圳大学 | SSD (solid State disk) -oriented gene comparison method and system |
CN110246543B (en) * | 2019-06-21 | 2021-02-26 | 元码基因科技(北京)股份有限公司 | Method and computer system for detecting copy number variation by using single sample based on second-generation sequencing technology |
CN110556165B (en) * | 2019-09-12 | 2022-03-18 | 浙江大学 | Method for rapidly identifying transgene or gene editing material and insertion site thereof |
CN110797087B (en) * | 2019-10-17 | 2020-11-03 | 南京医基云医疗数据研究院有限公司 | Sequencing sequence processing method and device, storage medium and electronic equipment |
CN117238376B (en) * | 2023-09-27 | 2024-04-30 | 上海序祯达生物科技有限公司 | Virus vector sequence analysis system and method based on second-generation sequencing technology |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001040301A2 (en) * | 1999-12-01 | 2001-06-07 | Akzo Nobel N.V. | A gene, disrupted in schizophrenia |
WO2007000663A3 (en) * | 2005-03-09 | 2007-03-29 | Methexis Genomics N V | Genetic diagnosis using multiple sequence variant analysis |
CN105512514A (en) * | 2014-09-23 | 2016-04-20 | 深圳华大基因股份有限公司 | MHC completion database, and establishment method and application thereof |
CN105543380A (en) * | 2016-01-27 | 2016-05-04 | 北京诺禾致源生物信息科技有限公司 | Method and device for detecting gene fusion |
CN105631242A (en) * | 2015-12-25 | 2016-06-01 | 中国农业大学 | Method for identifying transgenic events through whole genome sequencing data |
CN105989246A (en) * | 2015-01-28 | 2016-10-05 | 深圳华大基因研究院 | Variation detection method and device assembled based on genomes |
WO2017116139A1 (en) * | 2015-12-28 | 2017-07-06 | (주)신테카바이오 | System for analyzing bioactive variation using genetic variation information on individual's genome |
CN107180166A (en) * | 2017-04-21 | 2017-09-19 | 北京希望组生物科技有限公司 | A kind of full-length genome structure variation analysis method and system being sequenced based on three generations |
-
2017
- 2017-12-12 CN CN201711320249.0A patent/CN108073791B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001040301A2 (en) * | 1999-12-01 | 2001-06-07 | Akzo Nobel N.V. | A gene, disrupted in schizophrenia |
WO2007000663A3 (en) * | 2005-03-09 | 2007-03-29 | Methexis Genomics N V | Genetic diagnosis using multiple sequence variant analysis |
CN105512514A (en) * | 2014-09-23 | 2016-04-20 | 深圳华大基因股份有限公司 | MHC completion database, and establishment method and application thereof |
CN105989246A (en) * | 2015-01-28 | 2016-10-05 | 深圳华大基因研究院 | Variation detection method and device assembled based on genomes |
CN105631242A (en) * | 2015-12-25 | 2016-06-01 | 中国农业大学 | Method for identifying transgenic events through whole genome sequencing data |
WO2017116139A1 (en) * | 2015-12-28 | 2017-07-06 | (주)신테카바이오 | System for analyzing bioactive variation using genetic variation information on individual's genome |
CN105543380A (en) * | 2016-01-27 | 2016-05-04 | 北京诺禾致源生物信息科技有限公司 | Method and device for detecting gene fusion |
CN107180166A (en) * | 2017-04-21 | 2017-09-19 | 北京希望组生物科技有限公司 | A kind of full-length genome structure variation analysis method and system being sequenced based on three generations |
Also Published As
Publication number | Publication date |
---|---|
CN108073791A (en) | 2018-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073791B (en) | Method based on two generation sequencing datas detection target gene structure variation | |
CN110010193B (en) | Complex structure variation detection method based on hybrid strategy | |
CN113160882B (en) | Pathogenic microorganism metagenome detection method based on third generation sequencing | |
US20200294628A1 (en) | Creation or use of anchor-based data structures for sample-derived characteristic determination | |
CN105631242A (en) | Method for identifying transgenic events through whole genome sequencing data | |
CN109545278B (en) | Method for identifying interaction between plant lncRNA and gene | |
CN108304694B (en) | Method for analyzing gene mutation based on second-generation sequencing data | |
CN106048009A (en) | Label joint for detection of ultra-low-frequency gene mutation and application of label joint | |
CN114121160B (en) | Method and system for detecting macrovirus group in sample | |
CN107267613A (en) | Sequencing data processing system and SMN gene detection systems | |
CN106355045B (en) | A kind of method and device based on amplification second filial sequencing small fragment insertion and deletion detection | |
CN110838340A (en) | Method for identifying protein biomarkers independent of database search | |
CN104611443A (en) | Molecular identification method of kiwi interspecific hybridization cultivar Jinyan | |
CN110875082B (en) | Microorganism detection method and device based on targeted amplification sequencing | |
CN109337997B (en) | Camellia polymorphism chloroplast genome microsatellite molecular marker primer and method for screening and discriminating kindred species | |
Kruppa et al. | Virus detection in high-throughput sequencing data without a reference genome of the host | |
CN113724783B (en) | Method for detecting and typing repetition number of short tandem repeat sequence | |
CN106021987B (en) | Ultralow frequency mutating molecule label clustering clustering algorithm | |
CN110111839A (en) | The method and its application of reads number are supported in mutation in a kind of accurate quantification tumour standard items | |
CN105603081B (en) | Non-diagnosis-purpose qualitative and quantitative detection method for intestinal microorganisms | |
CN116312779A (en) | Method and apparatus for detecting sample contamination and identifying sample mismatch | |
JP5403563B2 (en) | Gene identification method and expression analysis method in comprehensive fragment analysis | |
CN107665290A (en) | A kind of method and apparatus of data processing | |
CN108304693A (en) | Utilize the method for high-flux sequence data analysis Gene Fusion | |
CN108676906B (en) | SSR locus of corn chloroplast genome and application of SSR locus in variety identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20180807 Address after: 215100 402 unit of Biotechnology Park, No. 218 Sang Tian street, Wuzhong District, Suzhou, Jiangsu, China. 2 Applicant after: Code gene technology (Suzhou) Co., Ltd. Address before: 100102 Chaoyang District, Beijing Guang Shun North Street 5, A 4 area of fusion power. Applicant before: Meta code gene technology (Beijing) Limited by Share Ltd |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |