CN108073791B - Method based on two generation sequencing datas detection target gene structure variation - Google Patents

Method based on two generation sequencing datas detection target gene structure variation Download PDF

Info

Publication number
CN108073791B
CN108073791B CN201711320249.0A CN201711320249A CN108073791B CN 108073791 B CN108073791 B CN 108073791B CN 201711320249 A CN201711320249 A CN 201711320249A CN 108073791 B CN108073791 B CN 108073791B
Authority
CN
China
Prior art keywords
sequence
library
sequences
compared
reference sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711320249.0A
Other languages
Chinese (zh)
Other versions
CN108073791A (en
Inventor
郎继东
田埂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Code gene technology (Suzhou) Co., Ltd.
Original Assignee
Code Gene Technology (suzhou) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Code Gene Technology (suzhou) Co Ltd filed Critical Code Gene Technology (suzhou) Co Ltd
Priority to CN201711320249.0A priority Critical patent/CN108073791B/en
Publication of CN108073791A publication Critical patent/CN108073791A/en
Application granted granted Critical
Publication of CN108073791B publication Critical patent/CN108073791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention discloses the method based on two generation sequencing datas detection target gene structure variation.Method of the invention is directly from raw sequencing data, the detection of gene structure variation can be carried out by quick and easy comparison, eliminate the parameter setting of complicated alignment algorithm, not only there is apparent promotion in detection time, and only need single-ended sequencing also achievable analysis detection, data volume is reduced, to further reduce sequencing cost, there has also been higher sensitivity and specificity in detection result.

Description

Method based on two generation sequencing datas detection target gene structure variation
Technical field
The present invention relates generally to genetic test fields, particularly detect target gene structure based on two generation sequencing datas The method of variation.
Background technique
As sequencing cost is lower and lower, with two generation sequencing technologies predicted gene group DNA level structure variations (SVs: Structure variations) methods and techniques it is more and more, currently based on two generation sequencing data detection structures variation There are mainly four types of methods:
The first is the method (Read depth) by sequencing overburden depth.For single sample, to be somebody's turn to do by detection Sample refers to the depth distribution situation of the short sequence (reads) on genome at one to detect;For paired sample (Case- Control), then be by with identify loss in the presence of comparing two samples and repeat multiplication region, the disadvantage is that by reality The amplification skewed popularity and sequencing skewed popularity for testing link are affected, and result is frequently not very accurate.
Second is to match double end sequencing methods (Paired-end sequencing).This method can detecte The structure variations such as insertion, missing, inversion, the transposition of large fragment, but it is limited to the standard deviation (sequencing of the Insert Fragment length of sequencing Borrowing segment refer to before sequencing in the DNA fragmentation length that is interrupted of building DNA sequencing library stage), and it is most Method is too dependent on alignment algorithm, and influence of the parameter setting to result is very big.
The third is the method (Split reads) that sequence reads long segmentation.This method can be by the Soft- on comparing Clip reads accurately to find the breakpoint location of structure variation, but in addition to the repetitive sequence on genome on result influence compared with Other than big, the similarly also excessive setting dependent on alignment algorithm and its parameter.
4th kind is the method (Denovo assembly) assembled.Although this method can more direct detection structure Variation, but the assembling of the short sequence based on the sequencing of two generations is still relatively difficult due to being influenced by repeat region on genome, And it can also be greatly increased in cost.
In conclusion the method based on two generation sequencing technologies predicted gene group DNA level structure variations still needs to higher speed Degree and higher sensitivity and specificity.
Summary of the invention
In order to solve at least partly technical problem in the prior art, the present invention, which is provided, detects mesh based on two generation sequencing datas The method for marking gene structure variation.The analysis of progress gene structure variation that by means of the present invention can be simple and quick, and And analysis result is reliable, high sensitivity, high specificity.Specifically, the present invention includes the following contents.
The first aspect of the present invention provides a kind of method based on two generation sequencing datas detection target gene structure variation, Itself the following steps are included:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as second Reference sequences library, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library The sum of length;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively, Using the sequence A and B to be compared as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence to be compared A and B is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A to be compared And B;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with the sequence in the second reference sequences library respectively, and And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in second comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes first aligned sequences pair.
The second aspect of the present invention provides another side based on two generation sequencing datas detection target gene structure variation Method comprising following steps:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as second Reference sequences library, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library The sum of length;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can be complete The sequence matched is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in first comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences are to sequence complete respectively with the different target gene in first reference sequences library Match, then by first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu Sequence exact matching in first reference sequences library in same gene, then terminate comparison, and remove the first aligned sequences pair.
The method of the first aspect and second aspect according to the present invention, including repeating step (3), (4).It is excellent Selection of land further comprises then determining that there are the target gene structures when the quantity of second aligned sequences pair is 2 or more Otherwise variation determines that the target gene structure variation is not present.
The method of the first aspect and second aspect according to the present invention, wherein the target gene structure variation includes base Because of at least one of fusion, gene inversion and gene displacement.
The method of the first aspect and second aspect according to the present invention, wherein being made of multiple primitive sequencer sequences The length p of each primitive sequencer sequence is 75-350bp, and p > (n+m) × 2 in set.
The method of the first aspect and second aspect according to the present invention, wherein the quantity of the candidate target gene is 2- 100。
Method according to the first aspect of the invention, wherein described first compares using BLAST algorithm, and second ratio To using SOAP algorithm.
Method according to the second aspect of the invention, wherein described first compares using SOAP algorithm, and second ratio To using BLAST algorithm.
The method of the first aspect and second aspect according to the present invention, wherein the multiple primitive sequencer sequence is both-end The data collating sequence of sequencing sequence or single-ended sequencing sequence.
Method of the invention can carry out gene structure change by quick and easy comparison directly from raw sequencing data Different detection eliminates the parameter setting of complicated alignment algorithm, not only has apparent promotion in detection time, and only need It is single-ended that also achievable analysis detection is sequenced, data volume is reduced, to further reduce sequencing cost, in detection result There has also been higher sensitivity and specificity.
Specific embodiment
The existing various exemplary embodiment that the present invention will be described in detail, the detailed description are not considered as to limit of the invention System, and it is understood as the more detailed description to certain aspects of the invention, characteristic and embodiment.
It should be understood that it is to describe special embodiment that heretofore described term, which is only, it is not intended to limit this hair It is bright.In addition, for the numberical range in the present invention, it is thus understood that specifically disclose the range upper and lower bound and they it Between each median.Median and any other statement value in any statement value or stated ranges or in the range Lesser range is also included in the present invention each of between interior median.These small range of upper and lower bounds can be independent Ground includes or excludes in range.
Unless otherwise stated, all technical and scientific terms used herein has the routine in field of the present invention The normally understood identical meanings of technical staff.Although the present invention only describes preferred method and material, of the invention Implement or also can be used and similar or equivalent any method and material described herein in testing.The institute mentioned in this specification There is document to be incorporated by reference into, to disclosure and description method relevant to the document and/or material.It is incorporated to any When document conflicts, it is subject to the content of this specification.
It is open term, i.e., about "comprising" used herein, " comprising ", " having ", " containing " etc. Mean including but not limited to.About "and/or" used herein, including any of the things or whole combinations.
The method of the first aspect of the present invention is the method that the structure variation of target gene is detected based on two generation sequencing datas, It includes following 4 steps:
(1) in the set being made of multiple primitive sequencer sequences, divide for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base Jie Qu not fallen, then take 5 ' ends of sequence and 3 ' each n bases in end after interception, constitute sequence A to be compared and to be compared Sequence B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using human genomic sequence as Two reference sequences libraries, wherein the sum of each sequence length is less than the sequence in second reference sequences library in first reference sequences library The sum of column length;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different genes in the first reference sequences library respectively, by institute Sequence A and B to be compared is stated as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence to be compared A and B is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A to be compared And B;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with the sequence in the second reference sequences library respectively, and And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in second comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes first aligned sequences pair.
Each step described in detail below.
Step (1):
In step (1) of the invention, primitive sequencer sequence (sometimes referred to as original series) is the sequence of two generations sequencing, respectively The length p of sequence is generally between 75-350bp, between preferably 100-200bp.Primitive sequencer sequence includes single-ended sequencing sequence, Or the data collating sequence of both-end sequencing sequence.
It in method of the invention, needs to handle each primitive sequencer sequence, including is intercepted respectively from 5 ' ends and 3 ' ends Fall m base, wherein integer of the m between 0-20, preferably 5-10, to improve the sensitivity of detection.In certain embodiments In, m can be 0, it can use 5 ' ends of the original series directly acquired and 3 ' terminal sequences.
In present invention, it is desirable to taking two sections after interception in primitive sequencer sequence to be used as sequence A and B to be compared.Sequence to be compared Column A and B length be not particularly limited, generally may respectively be n base, wherein integer of the n between 27-50, preferably 30-45 it Between integer.The length of sequence A and B to be compared can be identical, can not also be identical.In certain embodiments, sequence to be compared The length for arranging A and B is identical, and is 32-36bp.If the length of n is too short, the accuracy of testing result can be reduced.It is another Aspect will affect the speed of detection if the length of n is too long.When integer of the n between 27-50, guaranteeing detection speed There is good accuracy simultaneously.
It should be noted that primitive sequencer sequence of the invention is the sequence of nucleic acids in samples, do not include in order to be sequenced and The label or label of introducing, such as library sequence etc..
In the present invention, the collection of all original series of two generations sequencing can be by the set that multiple primitive sequencer sequences form It closes, is also possible to the set of part of sequence.As long as described gather the sequence for covering target gene, it is not particularly limited collection The composition of conjunction.It is preferred that all two generation sequencing sequence set of sample.
Step (2):
The step of step (2) is selected reference sequences library, wherein reference sequences library includes the first reference sequences library and second Reference sequences library.
First reference sequences library is made of the sequence of multiple candidate target genes, and wherein candidate target gene is to include target Gene including gene, it is preferable that the quantity of candidate target gene is 2 or more, preferably 2-100.If quantity is excessive, can drop Low detection speed.If quantity is too small, although detection speed with higher, the target gene range of detection is small, application Range is restricted.For example, the first reference sequences library need to include at least when detecting the fusion of the two genes of ALK and EML4 ALK and EML4.Preferably, in the first reference sequences library, the sequence of each candidate target gene is deposited in the form of mutually identifiable ?.It can judge corresponding gene when sequence A and/or B exactly matched to be compared, easily as a result, so as to further identify It is the structure variation of which kind of target gene.
Second reference sequences library is made of the full sequence of entire species gene group, i.e. whole genome sequence.It can be used Known full-length genome database.For example, entire human genomic sequence conduct can be used when target gene is human gene Second reference sequences library.It can origin be derived from be currently known any data composition.For example, in exemplary arrangement, it can source In the sequence library of http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/.
In the first reference sequences library of the invention the sum of each sequence length less than the second reference sequences library sequence length it With.It is preferred that the sum of each sequence length is the 1/10-1/ of the sum of the sequence length in the second reference sequences library in the first reference sequences library 10000, more preferable 1/10-1/1000, further preferably 1/100-1/1000.
Step (3):
Step (3) is that the step for rapidly finding out the first aligned sequences pair corresponding to potential structure variation is compared by first Suddenly comprising sequence A and B to be compared is compared with the sequence in the first reference sequences library respectively.If sequence A to be compared (" exact matching " is meant that mistake in the present invention with the sequence exact matching of different genes in the first reference sequences library respectively with B It is 0), then using sequence A and B to be compared as the first aligned sequences pair, if sequence A and B and first to be compared refers to sequence with number Corresponding sequence cannot exactly match and (" cannot exactly match " in the present invention or " Incomplete matching " is meant that mispairing in column library Number is 1 or more, or cannot be matched completely) or sequence A and B to be compared and in same gene in the first reference sequences library Sequence exact matching, then terminate comparison, and remove sequence A and B to be compared.
In the present invention, the first comparison can be used known any algorithm and carry out.Preferably, it first compares using BLAST Algorithm carries out.
Since the sequence of the structure variation comprising target gene is very small sequence for whole gene group data Column, and two generations sequencing data itself contain numerous original series, if each original series respectively with whole gene group Data are compared respectively, can greatly increase the time used when comparison, to keep detection speed very slow.As described above, the The sequence data in one reference sequences library is small, is substantially shorter the first comparison time of each original series, and can exclude not having There are most of original series comprising structure variation, so as to rapidly find out the potential sequence comprising structure variation.
Step (4):
Step (4) is to be compared by second to verify the first comparison result, thus the step of improving detection accuracy, packet It includes and compares the first aligned sequences pair and the sequence progress second in the second reference sequences library, if the first aligned sequences centering is each Sequence is exactly matched with sequence corresponding in the second reference sequences library respectively, and the sequence that can be exactly matched compares to be unique To sequence to (in the present invention, " unique compare " is meant that the upper reference sequences of comparison are primary and only once), then by the first ratio To sequence to as the second aligned sequences pair, if the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result Column then terminate comparison, and remove the first aligned sequences pair to not being unique aligned sequences pair.
In the present invention, the second comparison can be used known any algorithm and carry out.Preferably, it second compares using SOAP calculation Method carries out.
Due to the complexity of gene order, only the comparison result in the first reference sequences library may have error, pass through The inspection in two reference sequences libraries, substantially increases the accuracy of testing result.
It should be noted that in certain embodiments, can analyze each original series in set, but this is simultaneously Do not mean that each original series will pass through step (1)-(4).Most of original series when step (3) by being removed. In addition, small part original series can be removed at step (4).It in certain embodiments, can be only former to the part in set Beginning sequence is analyzed.For example, when obtaining second aligned sequences to (for example, 2 or more the second aligned sequences to), It terminates and compares, and other original series are no longer analyzed.
In certain embodiments, the method for the present invention includes step (3) and (4) are repeated, to obtain multiple second ratios To sequence pair.In certain embodiments, if there is upper second aligned sequences pair, then it can determine that the structure there are gene Variation.Preferably, when the quantity x of second aligned sequences pair is 2 or more, determine that there are target gene structure changes It is different, otherwise determine that the target gene structure variation is not present.When x is 2 or more, show in the original series for there are 2 or more There are structure variations, to further increase the accuracy of detection.
Although showing each step in the form of present invention digital number as step (1)-(4), number herein is compiled Number be only distinguish each step purpose, be not offered as between each step sequentially.
Method according to the second aspect of the invention is another structure that target gene is detected based on two generation sequencing datas The method of variation comprising following 4 steps:
(1) identical as step (1) in first aspect present invention;
(2) identical as step (2) in first aspect present invention;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can be complete The sequence matched is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in first comparison result is not to being uniquely to compare Sequence pair then terminates comparison, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences are to sequence complete respectively with the different target gene in first reference sequences library Match, then by first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with corresponding sequence in first reference sequences library, Huo Zheyu Sequence exact matching in first reference sequences library in same gene, then terminate subsequent second and compare, and removes first and compare sequence Column pair.
In method according to a second aspect of the present invention, in addition to above-mentioned first aligned sequences to and the second aligned sequences pair and this Except difference in the method for invention first aspect, remaining title, term are identical as in the method for first aspect present invention, Details are not described herein.
Embodiment 1
It chooses actual probes capture and is sequenced with the both-end (Paired-end) that NextSeq500 sequencer obtains And the data Data-0001 of a length of 151bp is read, analysis inspection is carried out to the sequencing data with the method for first aspect present invention It surveys.The embodiment sample is the positive sample for being ALK-EML4 by Arms method validation.
1. the sequence of ALK gene and EML4 gene is extracted on human genome, as target gene reference sequences.
2. the both-end sequencing data of embodiment sample Data-0001 is merged into single-ended (Single-end) sequencing data, Each 10bp before and after sequence reads is fallen in interception, and retains each 35bp in front and back of remaining sequence reads, with two sections of 35bp sequences, altogether The long sequence data as subsequent analysis of the reading of 70bp.
3. by processed data application BLAST algorithm obtained in step 2 and the comprising ALK gene and EML4 gene One reference sequences library is compared, if in the comparison result of two sections of sequences, mispairing is not present, i.e. mispairing number is 0;And it compares Two sequences to be compared are exactly matched with the corresponding position of ALK gene and EML4 gene respectively as the result is shown, then will it is described to than To sequence to as the first aligned sequences pair, and flag sequence is to for ID1, if mispairing number is 1 in first comparison result Above or cannot compare, although two sequences to be compared match completely, mated position be located at simultaneously ALK gene or EML4 gene then terminates the comparison of the original series.
4. comparing the sequence for being labeled as ID1 obtained in step 3 to SOAP algorithm to human genomic sequence library (from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/).If the sequence of ID1 Each sequence of centering is exactly matched with sequence corresponding in human genomic sequence library respectively, and the sequence pair that can be exactly matched For unique aligned sequences pair, then it is to label by the sequence of the ID1.
If the sequence that mispairing number is 1 or more or mispairing number is 0 in this comparison result is not to being unique aligned sequences It is right, then comparison is terminated, and remove the corresponding data of the ID1.
3 and 4 are repeated the above steps, if obtaining 2 pairs or more marks the sequence pair for being, for positive structure variation As a result, being labeled as Se-SV;It is negative findings when it is less than or equal to 1.
Embodiment 2
Other than using sample Data-0002, gene structure variation is detected in the same manner as example 1.
Reference example 1
The obtained both-end sequencing data of sample Data-0001 and Data-0002 is used into BWA algorithm (Aln) and two respectively Kind sequence reads long split plot design (Split reads) and carries out analysis detection, is as a result denoted as Aln-S1, Aln-S2 respectively.
Reference example 2
BWA algorithm (Mem) and identical two kinds of sequences are used respectively by what sample Data-0001 and Data-0002 were obtained It reads long split plot design (Split reads) and carries out analysis detection, be as a result denoted as Mem-S1, Mem-S2.
Compare the detection situation and runing time of method and reference example 1 and 2 three kind of method of the invention, statistics such as following table is united Shown in meter:
The detection case comparison result of 1 three kinds of methods of table
Without departing substantially from the scope or spirit of the invention, the specific embodiment of description of the invention can be done more Kind improvements and changes, this will be apparent to those skilled in the art.Other realities obtained by specification of the invention Applying mode for technical personnel is apparent obtain.Present specification and embodiment are merely exemplary.

Claims (16)

1. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:
(1) it in the set being made of multiple primitive sequencer sequences, is cut respectively for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base is taken down, 5 ' ends and 3 ' each n bases in end of sequence after intercepting then is taken, constitutes sequence A to be compared and sequence to be compared B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as the second reference Sequence library, wherein the sum of each sequence length is less than the sequence length in second reference sequences library in first reference sequences library The sum of;
(3) the sequence A and B to be compared first is carried out with the sequence in first reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence of different target gene in the first reference sequences library respectively, by institute Sequence A and B to be compared is stated as the first aligned sequences pair,
If sequence A and B to be compared cannot be exactly matched with the sequence in the first reference sequences library or sequence A and B to be compared It is exactly matched with the sequence in same gene in the first reference sequences library, then terminates comparison, and remove sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in second reference sequences library to compare,
If each sequence of the first aligned sequences centering is exactly matched with sequence corresponding in the second reference sequences library respectively, and And the sequence that can be exactly matched is to for unique aligned sequences pair, then by first aligned sequences to as the second aligned sequences It is right,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in the second comparison result to not being unique aligned sequences pair, Comparison is then terminated, and removes first aligned sequences pair.
2. according to the method described in claim 1, including step (3) and (4) are repeated.
3. according to the method described in claim 2, further comprise when the quantity x of second aligned sequences pair be 2 or more when, Then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.
4. according to the method described in claim 1, wherein the target gene structure variation include Gene Fusion, gene inversion and At least one of gene displacement.
5. according to the method described in claim 1, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences The length p of sequence is 75-350bp, and p > (n+m) × 2.
6. according to the method described in claim 1, wherein the quantity of the candidate target gene is 2-100.
7. according to the method described in claim 1, wherein described first comparing using BLAST algorithm, and second comparison is adopted With SOAP algorithm.
8. according to the method described in claim 1, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed And sequence or single-ended sequencing sequence.
9. a kind of method based on two generation sequencing datas detection target gene structure variation comprising following steps:
(1) it in the set being made of multiple primitive sequencer sequences, is cut respectively for each primitive sequencer sequence from 5 ' ends and 3 ' ends M base is taken down, 5 ' ends and 3 ' each n bases in end of sequence after intercepting then is taken, constitutes sequence A to be compared and sequence to be compared B, wherein integer of the m between 0-20, integer of the n between 27-50;
(2) the first reference sequences library is formed by the sequence of multiple candidate target genes, using whole genome sequence as the second reference Sequence library, wherein the sum of each sequence length is less than the sequence length in second reference sequences library in first reference sequences library The sum of;
(3) the sequence A and B to be compared first is carried out with the sequence in second reference sequences library respectively to compare,
If sequence A and B to be compared is exactly matched with the sequence in the second reference sequences library respectively, and can exactly match Sequence is to for unique aligned sequences pair, then using sequence A and B to be compared as the first aligned sequences pair,
If the sequence that mispairing number is 1 or more or mispairing number is 0 in the first comparison result to not being unique aligned sequences pair, Comparison is then terminated, and removes sequence A and B to be compared;
(4) first aligned sequences pair second is carried out with the sequence in first reference sequences library to compare,
If the first aligned sequences to respectively with the sequence of the different target gene in first reference sequences library exactly match, By first aligned sequences to as the second aligned sequences pair,
If the first aligned sequences pair cannot be exactly matched with the sequence in first reference sequences library, or be referred to first Sequence exact matching in sequence library in same gene, then terminate comparison, and remove the first aligned sequences pair.
10. according to the method described in claim 9, including step (3) and (4) are repeated.
11. according to the method described in claim 10, further comprising when the quantity x of second aligned sequences pair is 2 or more When, then determine that there are the target gene structure variations, otherwise determines that the target gene structure variation is not present.
12. according to the method described in claim 9, wherein the target gene structure variation includes Gene Fusion, gene inversion With gene displacement at least one of.
13. according to the method described in claim 9, wherein each primitive sequencer in the set being made of multiple primitive sequencer sequences The length p of sequence is 75-350bp, and p > (n+m) × 2.
14. according to the method described in claim 9, wherein the quantity of the candidate target gene is 2-100.
15. according to the method described in claim 9, wherein described first comparing using SOAP algorithm, and second comparison is adopted Use BLAST algorithm.
16. according to the method described in claim 9, wherein the multiple primitive sequencer sequence is that the data of both-end sequencing sequence are closed And sequence or single-ended sequencing sequence.
CN201711320249.0A 2017-12-12 2017-12-12 Method based on two generation sequencing datas detection target gene structure variation Active CN108073791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711320249.0A CN108073791B (en) 2017-12-12 2017-12-12 Method based on two generation sequencing datas detection target gene structure variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711320249.0A CN108073791B (en) 2017-12-12 2017-12-12 Method based on two generation sequencing datas detection target gene structure variation

Publications (2)

Publication Number Publication Date
CN108073791A CN108073791A (en) 2018-05-25
CN108073791B true CN108073791B (en) 2019-02-05

Family

ID=62158249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711320249.0A Active CN108073791B (en) 2017-12-12 2017-12-12 Method based on two generation sequencing datas detection target gene structure variation

Country Status (1)

Country Link
CN (1) CN108073791B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114807398A (en) * 2018-10-30 2022-07-29 厦门极元科技有限公司 Identification method and device, and typing method and device for salmonella in metagenome
CN109727644B (en) * 2018-11-12 2021-09-07 山东省医学科学院基础医学研究所 Venn diagram making method and system based on microbial genome second-generation sequencing data
CN109652513B (en) * 2019-02-25 2022-08-23 元码基因科技(北京)股份有限公司 Method and kit for accurately detecting individual mutation of liquid biopsy based on second-generation sequencing technology
CN110085284B (en) * 2019-04-29 2021-02-26 深圳大学 SSD (solid State disk) -oriented gene comparison method and system
CN110246543B (en) * 2019-06-21 2021-02-26 元码基因科技(北京)股份有限公司 Method and computer system for detecting copy number variation by using single sample based on second-generation sequencing technology
CN110556165B (en) * 2019-09-12 2022-03-18 浙江大学 Method for rapidly identifying transgene or gene editing material and insertion site thereof
CN110797087B (en) * 2019-10-17 2020-11-03 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN117238376B (en) * 2023-09-27 2024-04-30 上海序祯达生物科技有限公司 Virus vector sequence analysis system and method based on second-generation sequencing technology

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001040301A2 (en) * 1999-12-01 2001-06-07 Akzo Nobel N.V. A gene, disrupted in schizophrenia
WO2007000663A3 (en) * 2005-03-09 2007-03-29 Methexis Genomics N V Genetic diagnosis using multiple sequence variant analysis
CN105512514A (en) * 2014-09-23 2016-04-20 深圳华大基因股份有限公司 MHC completion database, and establishment method and application thereof
CN105543380A (en) * 2016-01-27 2016-05-04 北京诺禾致源生物信息科技有限公司 Method and device for detecting gene fusion
CN105631242A (en) * 2015-12-25 2016-06-01 中国农业大学 Method for identifying transgenic events through whole genome sequencing data
CN105989246A (en) * 2015-01-28 2016-10-05 深圳华大基因研究院 Variation detection method and device assembled based on genomes
WO2017116139A1 (en) * 2015-12-28 2017-07-06 (주)신테카바이오 System for analyzing bioactive variation using genetic variation information on individual's genome
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001040301A2 (en) * 1999-12-01 2001-06-07 Akzo Nobel N.V. A gene, disrupted in schizophrenia
WO2007000663A3 (en) * 2005-03-09 2007-03-29 Methexis Genomics N V Genetic diagnosis using multiple sequence variant analysis
CN105512514A (en) * 2014-09-23 2016-04-20 深圳华大基因股份有限公司 MHC completion database, and establishment method and application thereof
CN105989246A (en) * 2015-01-28 2016-10-05 深圳华大基因研究院 Variation detection method and device assembled based on genomes
CN105631242A (en) * 2015-12-25 2016-06-01 中国农业大学 Method for identifying transgenic events through whole genome sequencing data
WO2017116139A1 (en) * 2015-12-28 2017-07-06 (주)신테카바이오 System for analyzing bioactive variation using genetic variation information on individual's genome
CN105543380A (en) * 2016-01-27 2016-05-04 北京诺禾致源生物信息科技有限公司 Method and device for detecting gene fusion
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations

Also Published As

Publication number Publication date
CN108073791A (en) 2018-05-25

Similar Documents

Publication Publication Date Title
CN108073791B (en) Method based on two generation sequencing datas detection target gene structure variation
CN110010193B (en) Complex structure variation detection method based on hybrid strategy
CN113160882B (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
CN105631242A (en) Method for identifying transgenic events through whole genome sequencing data
CN109545278B (en) Method for identifying interaction between plant lncRNA and gene
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
CN106048009A (en) Label joint for detection of ultra-low-frequency gene mutation and application of label joint
CN114121160B (en) Method and system for detecting macrovirus group in sample
CN107267613A (en) Sequencing data processing system and SMN gene detection systems
CN106355045B (en) A kind of method and device based on amplification second filial sequencing small fragment insertion and deletion detection
CN110838340A (en) Method for identifying protein biomarkers independent of database search
CN104611443A (en) Molecular identification method of kiwi interspecific hybridization cultivar Jinyan
CN110875082B (en) Microorganism detection method and device based on targeted amplification sequencing
CN109337997B (en) Camellia polymorphism chloroplast genome microsatellite molecular marker primer and method for screening and discriminating kindred species
Kruppa et al. Virus detection in high-throughput sequencing data without a reference genome of the host
CN113724783B (en) Method for detecting and typing repetition number of short tandem repeat sequence
CN106021987B (en) Ultralow frequency mutating molecule label clustering clustering algorithm
CN110111839A (en) The method and its application of reads number are supported in mutation in a kind of accurate quantification tumour standard items
CN105603081B (en) Non-diagnosis-purpose qualitative and quantitative detection method for intestinal microorganisms
CN116312779A (en) Method and apparatus for detecting sample contamination and identifying sample mismatch
JP5403563B2 (en) Gene identification method and expression analysis method in comprehensive fragment analysis
CN107665290A (en) A kind of method and apparatus of data processing
CN108304693A (en) Utilize the method for high-flux sequence data analysis Gene Fusion
CN108676906B (en) SSR locus of corn chloroplast genome and application of SSR locus in variety identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20180807

Address after: 215100 402 unit of Biotechnology Park, No. 218 Sang Tian street, Wuzhong District, Suzhou, Jiangsu, China. 2

Applicant after: Code gene technology (Suzhou) Co., Ltd.

Address before: 100102 Chaoyang District, Beijing Guang Shun North Street 5, A 4 area of fusion power.

Applicant before: Meta code gene technology (Beijing) Limited by Share Ltd

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant