CN103270175B - Method and system for detecting the insertion sites of transgenic foreign fragments - Google Patents

Method and system for detecting the insertion sites of transgenic foreign fragments Download PDF

Info

Publication number
CN103270175B
CN103270175B CN201180062256.XA CN201180062256A CN103270175B CN 103270175 B CN103270175 B CN 103270175B CN 201180062256 A CN201180062256 A CN 201180062256A CN 103270175 B CN103270175 B CN 103270175B
Authority
CN
China
Prior art keywords
short
sequence
fragment sequence
genome
movie section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201180062256.XA
Other languages
Chinese (zh)
Other versions
CN103270175A (en
Inventor
郭小森
陈帅
张雪梅
郎继东
李俊
胡雪松
陈聪群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN103270175A publication Critical patent/CN103270175A/en
Application granted granted Critical
Publication of CN103270175B publication Critical patent/CN103270175B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1082Preparation or screening gene libraries by chromosomal integration of polynucleotide sequences, HR-, site-specific-recombination, transposons, viral vectors

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Biomedical Technology (AREA)
  • Wood Science & Technology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Plant Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Microbiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Virology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses a method and a system for detecting the insertion sites of transgenic foreign fragments. The method comprises: determining the foreign single reads and genomic single reads through alignment, and determining the insertion sites of the foreign fragments in genomic sequence according to the intersection of the foreign single reads and genomic single reads.

Description

Detect the method and system of transgenosis exogenous sequences insertion point
Technical field
The present invention relates to bioinformatics technique field, particularly relate to a kind of detection method and system of transgenosis exogenous sequences insertion point.
Background technology
Along with listing and the spread of transgenic product, its safety issue causes the concern of people gradually.Whether generally speaking, the safety issue of transgenic product is mainly reflected in two aspects: one is edible safety, can have a negative impact after main consideration transgenic product is edible by people (poultry) to organism; Two is environmental safeties, one of its subject matter paid close attention to is that after genetically modified organism is discharged into external environment, whether foreign gene can float in environment, and retain in the environment and spread, cause genovariation, cause ecological risk, namely individuality, population, group, the ecosystem and the integrated environment disadvantageous ecological consequence that may therefore produce, makes ecotope be damaged, breaks the balance of original mos.Therefore to the safety evaluation of the genetically modified organism field of circulation with control just to seem particularly important, it is a wherein very important part that genovariation detects.
At high-throughput, before hrr gene omics technology does not occur, genovariation detects mainly by traditional Cytogenetic techniques, especially high-resolution chromosome banding pattern technology realizes, along with the development of DNA sequencing technology, detect the method for genovariation and also develop into based on chip technology and the high throughput analysis of DNA sequencing technology and the targeting analysis of PCR-based (polymerase chain reaction) technology.These methods waste time and energy, and experimental cost is also very high, and the resolving power detecting some genovariation is lower.Therefore urgently develop a kind of newly there is higher sensitivity, specific degree, resolving power is high, the genovariation detection technique that cost is low.
The heavy sequencing technologies of full-length genome carries out gene order-checking to the individuality that genome sequence is known, and carry out the method for difference analysis at individual or population level.Heavy sequencing technologies is the important channel detecting heritable variation.Existing variation detects and comprises single base mutation (single nucleotide polymorphism, be called for short SNP), insertion and deletion (insertion and deletion, be called for short Indel), structure variation (structurevariation, be called for short SV) and copy number variation (copy number variation, be called for short CNV) four Big mutation rate detection techniques, provide strong instrument for detecting at the enterprising row variation of genomic level.But above-mentioned four kinds of variation detection techniques are poor for the variation Detection results accuracy of large fragment (>=1kb).
Summary of the invention
The technical problem that an aspect of the present disclosure will solve is to provide a kind of detection method of transgenosis exogenous sequences insertion point, and accuracy is good.
According to an aspect of the present invention, a kind of transgenosis exogenous sequences insertion point detection method is provided, comprises:
Paired short fragment sequence and external source Insert Fragment sequence are compared and determine the one-sided short-movie section of external source (single reads), paired short fragment sequence obtains by carrying out two end sequence of resurveying to the sequenced fragments of testing sample;
Paired short fragment sequence and reference genome sequence are compared and determines the one-sided short-movie section of genome (single reads);
Common factor according to external source one-sided short-movie section and the one-sided short-movie section of genome determines the insertion point of exogenous sequences in genome sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention, external source one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence in external source Insert Fragment sequence; Genome one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence on reference genome sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention, external source one-sided short-movie section comprise only have a short fragment sequence and for once comparison to the normal paired short fragment sequence in external source Insert Fragment sequence; Genome one-sided short-movie section comprise only have a short fragment sequence and for once comparison to reference to the normal paired short fragment sequence on genome sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention, the method also comprises: be filtered into short fragment sequence to remove underproof short fragment sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention, be filtered into and short fragment sequence is comprised to remove underproof paired short fragment sequence: remove sequencing quality exceedes short-movie section series number 50% paired short fragment sequence lower than the base number of predetermined threshold; And/or removal sequencing result uncertain base number exceedes the paired short fragment sequence of paired short-movie section series number 10%; And/or the joint sequence removed in paired short fragment sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention, the length of the sequenced fragments of testing sample is 170-500bp, 500-1000bp, 1000-2000bp, 2000-10000bp; And/or the length of short fragment sequence is 40-75bp, 75bp-200bp; And/or the sequenced fragments of testing sample check order the base total amount that obtains be with reference to genome sequence base total amount 5-10 doubly, 10-20 doubly or more than 20 times.
The detection method of the transgenosis exogenous sequences insertion point that the embodiment of the present invention provides, the short fragment sequence that sequence of being resurveyed by two end obtains and the comparison of external source Insert Fragment obtain the one-sided short-movie section in the two ends enrichment of external source Insert Fragment, the short fragment sequence obtained in sequence of simultaneously being resurveyed by two end and the single reads obtained with reference to genome sequence comparison at external source Insert Fragment two ends; Then these two portions are got common factor, determine the position of common factor sequence on reference sequences; By the position of statistics common factor sequence support situation determination insertion point in each site of correspondence on reference sequences, take full advantage of the position characteristic of external source Insert Fragment, accuracy is good.
The technical problem that another aspect of the present disclosure will solve is to provide a kind of detection system of transgenosis exogenous sequences insertion point, and accuracy is good.
A kind of transgenosis exogenous sequences insertion point detection system is provided according to a further aspect in the invention, comprises:
Order-checking unit, obtains paired short fragment sequence for carrying out two end sequence of resurveying to the sequenced fragments of testing sample;
External source one-sided short-movie section determining unit, determines the one-sided short-movie section of external source (single reads) for paired short fragment sequence and external source Insert Fragment sequence being compared;
Genome one-sided short-movie section determining unit, compares paired short fragment sequence and reference genome sequence and determines the one-sided short-movie section of genome (single reads);
Insertion point determining unit, for determining the insertion point of exogenous sequences in genome sequence according to the common factor of external source one-sided short-movie section and the one-sided short-movie section of genome.
According to an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention, external source one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence in external source Insert Fragment sequence;
Genome one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence on reference genome sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention, external source one-sided short-movie section comprise only have a short fragment sequence and for once comparison to the normal paired short fragment sequence in external source Insert Fragment sequence; Genome one-sided short-movie section comprise only have a short fragment sequence and for once comparison to reference to the normal paired short fragment sequence on genome sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention, this system also comprises filtering unit, for being filtered into short fragment sequence to remove underproof short fragment sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention, filtering unit removes sequencing quality exceedes short-movie section series number 50% paired short fragment sequence lower than the base number of predetermined threshold; And/or removal sequencing result uncertain base number exceedes the paired short fragment sequence of paired short-movie section series number 10%; And/or the joint sequence removed in paired short fragment sequence.
According to an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention, the length of the sequenced fragments of testing sample is 170-500bp, 500-1000bp, 1000-2000bp, 2000-10000bp; And/or the length of short fragment sequence is 40-75bp, 75bp-200bp; And/or the sequenced fragments of testing sample check order the base total amount that obtains be with reference to genome sequence base total amount 5-10 doubly, 10-20 doubly or more than 20 times.
The detection system of the transgenosis exogenous sequences insertion point that the embodiment of the present invention provides, resurveyed by two end short fragment sequence and the comparison of external source Insert Fragment that sequence obtains of external source one-sided short-movie section determining unit obtains one-sided short-movie section in the two ends enrichment of external source Insert Fragment, and resurveyed by the two end short fragment sequence that obtains in sequence and the comparison of reference genome sequence of genome one-sided short-movie section determining unit obtains singlereads at external source Insert Fragment two ends; These two portions are got common factor by insertion point determining unit, determine the position of common factor sequence on reference sequences; By the position of statistics common factor sequence support situation determination insertion point in each site of correspondence on reference sequences, take full advantage of the position characteristic of external source Insert Fragment, accuracy is good.
Description of the invention provides in order to example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is in order to principle of the present invention and practical application are better described, and enables those of ordinary skill in the art understand the present invention thus design the various embodiments with various amendment being suitable for specific end use.
Accompanying drawing explanation
Fig. 1 illustrates the schematic diagram of transgenosis exogenous sequences insertion point detection method of the present invention;
Fig. 2 illustrates the schema of an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention;
Fig. 3 illustrates the schema of an application examples of transgenosis exogenous sequences insertion point detection method of the present invention;
Fig. 4 illustrates the schematic diagram determining insertion point example of the present invention;
Fig. 5 illustrates the structure iron of an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention;
Fig. 6 illustrates the structure iron of another embodiment of transgenosis exogenous sequences insertion point detection system of the present invention.
Embodiment
With reference to the accompanying drawings the present invention is described more fully, exemplary embodiment of the present invention is wherein described.
Fig. 1 illustrates the schematic diagram of transgenosis exogenous sequences insertion point detection method of the present invention.
As shown in Figure 1, in step S102, compare determine the one-sided short-movie section of external source (single reads) by by carrying out resurvey paired short fragment sequence that sequence obtains and external source Insert Fragment sequence of two end to the sequenced fragments of testing sample.External source one-sided short-movie section comprises in two short fragment sequences only has a short-movie section sequence alignment to the paired short fragment sequence in external source Insert Fragment sequence.Sequenced fragments (sequencing data) (transgenosis) and external source Insert Fragment sequence (insertsize) comparison, understand the comparison of enriched at Insert Fragment two ends to the azygous short fragment sequence on exogenous sequences, this is labeled as single reads 1.External source Insert Fragment is often referred to the exogenous sequences or transposon sequence that import in transgenic technology, and external source Insert Fragment such as can be obtained by clone.
In step S104, paired short fragment sequence and reference genome sequence are compared and determines the one-sided short-movie section of genome.Genome one-sided short-movie section comprises in two short fragment sequences only has a short-movie section sequence alignment to the paired short fragment sequence on reference genome sequence.Sequencing data and reference genome sequence (reference) comparison, this, to the azygous short fragment sequence on exogenous sequences, is labeled as single reads 2 by comparison enriched at Insert Fragment two ends equally.Be often referred to the genome sequence of certain species with reference to genome sequence genome, genome sequence can by checking order and assembling acquisition.
In step S106, the common factor according to external source one-sided short-movie section and the one-sided short-movie section of genome determines the insertion point of exogenous sequences in genome sequence.Singlereads in being gathered by single reads 1 and single reads 2 gets common factor, determines these position of common factor sequence on reference sequences.
In the above-described embodiments, twice comparison process need be carried out, first the short fragment sequence that sequence of being resurveyed by two end obtains and the comparison of external source Insert Fragment, obtain the single reads in the two ends enrichment of external source Insert Fragment, the short fragment sequence obtained in sequence of simultaneously being resurveyed by two end and the single reads obtained with reference to genome sequence comparison at external source Insert Fragment two ends; Then these two portions are got common factor, determine the position of common factor sequence on reference sequences; By statistics common factor sequence corresponding and support situation in each site on reference sequences, thus determine the position of insertion point, take full advantage of the position characteristic of external source Insert Fragment, accuracy is good.
Fig. 2 illustrates the schema of an embodiment of transgenosis exogenous sequences insertion point detection method of the present invention.
As shown in Figure 2, in step 202, the sequenced fragments of testing sample carries out two end (pair-end) and to resurvey sequence, obtains paired short fragment sequence.
DNA sample to be measured is broken at random the fragment of certain length, this fragment is called as sequenced fragments.The length such as value from 170-500bp, 500-1000bp, 1000-2000bp or 2000-10000bp of sequenced fragments.Checking order respectively in two ends from sequenced fragments during order-checking, thus obtains the short-movie section sequence information at a pair these sequenced fragments two ends, is each paired No. ID, short-movie section sequence distribution, has identical No. ID with two short fragment sequences in a pair short-movie section sequence.The sequenced fragments of testing sample check order the base total amount that obtains be with reference to genome sequence base total amount 5-10 doubly, 10-20 doubly or more than 20 times, to ensure required genome coverage.Preferably, the base total amount that obtains of checking order is at more than 20 times of Genome Size.
In step 204, filter two end and to resurvey the paired short fragment sequence that sequence obtains.
Receiving two end resurveys after the short fragment sequence of sequence gained, will underproof short fragment sequence removal by filtering.Such as, the sequencing sequence that sequencing quality exceedes whole piece short-movie section series number such as 50% lower than the base number of predetermined inferior quality threshold value is removed, wherein inferior quality threshold value is determined by concrete sequencing technologies and order-checking environment, such as, be that B (ASCII value) is as inferior quality threshold value using the mass value of base; The sequencing sequence uncertain base of sequencing result in short fragment sequence (N as in Illumina GA sequencing result) number being exceeded whole piece short-movie section series number 10% is removed; Joint sequence in short fragment sequence is removed; The removal joint sequence short fragment sequence in ground and other are tested the exogenous array comparison introduced, as various terminal sequence, if there is exogenous array in sequence, thinks defective sequence, remove.By filtering, remove underproof short fragment sequence, thus improve the accuracy detected.
In step 206, paired short fragment sequence and external source Insert Fragment sequence are compared and are obtained the one-sided fragment of external source.
Can compare with various common short data records comparison software such as soap, bwa etc.The length of sequenced fragments should be substantially identical, certain domain of walker can be allowed, be called normal short data records for the sequenced fragments of length in the normal range short fragment sequence obtained that checks order, the short fragment sequence obtained that checks order of the sequenced fragments outside normal range is called abnormal short data records.Domain of walker can be arranged according to demand voluntarily.During comparison, the minimum comparison length of the short fragment sequence of sequence of resurveying gained is 40bp, and the most very much not coupling number during comparison, a short data records allowed is as far as possible little, to ensure precise alignment.Such as, short-movie section sequence length is 90bp, the most very much not mates number and is set to 1 or 2.The actual value arranged can change along with the length of short fragment sequence.
After comparison, the short fragment sequence that sequence of resurveying obtains is divided into three types, 1. soap reads: exist in pairs and can comparison to the normal short data records in external source Insert Fragment sequence; 2. single reads: only have a comparison in two paired normal short data records in external source Insert Fragment sequence, such short data records is marked as single reads.In addition, paired abnormal short data records may be labeled as single reads by short data records comparison software, in this case, can increase filtration step and remove these abnormal short fragment sequences; 3. unmap reads: two paired short data records all do not have comparison in external source Insert Fragment sequence, and such short data records is marked as unmap reads.According to comparison result, extract and only have one with on external source Insert Fragment sequence alignment, and only comparison, to the single reads of external source Insert Fragment sequence last time, can ensure the specificity of comparison result like this.The single reads of acquisition is stored, such as, is stored in the document of a called after single file 1.
Step 208, paired short fragment sequence is compared with reference genome sequence and is obtained the one-sided fragment of genome.
Can compare with various common short data records comparison software such as soap, bwa etc.After comparison, the short data records that sequence of resurveying obtains is divided into three types, 1. soap reads: exist in pairs and can comparison to reference to the normal short data records on genome; 2. single reads: only have a comparison in two paired normal short data records on reference genome, such short data records is marked as single reads in addition, paired abnormal short data records may be labeled as single reads by short data records comparison software, in this case, filtration step can be increased and remove these abnormal short fragment sequences; 3. unmap reads: two paired short data records all do not have comparison on reference genome, and such short data records is marked as unmap reads.According to comparison result, be extracted into right short fragment sequence and only have a comparison on reference genome, and only comparison is to the single reads with reference to the genome last time, this is done to the specificity ensureing comparison result, put in the document of a called after single file2.Single reads azygous in single file 2 is sorted according to sample number order, and comparison result is pressed the separation of karyomit(e) order.
Step 210, according to the one-sided fragment of external source and genome one-sided fragment determination exogenous sequences insertion point.
Extract the single reads that single file 1 is identical with in single file 2 No. ID.Get common factor for No. ID according to the single reads in singlefile 1 and single file 2, in two files, No. ID identical and single reads that is mutually pairing is the single reads after qualified common factor.(in two paired short fragment sequences, a comparison is wherein to reference on genome, another just may comparison on external source Insert Fragment.No. ID of the short fragment sequence of this paired existence is identical.) according to occuring simultaneously in the size sequence with reference to corresponding site on genome (short data records comparison is to the zero position on reference genome), add the step-length of short fragment sequence, support situation on each site that single reads after statistics common factor is corresponding on reference sequences, determines the insertion point of transgenosis exogenous sequences.The support situation of the short fragment sequence on each site corresponding on reference sequences can present trough formula curve, and have a lower-most point namely, two ends exist the peak of projection.The peak of a short-movie section sequence enrichment is had at insertion point two ends, more trend towards insertion point, short fragment sequence support number can reduce gradually, occurs the tomography (i.e. the lower-most point of short fragment sequence support number) of a short fragment sequence in insertion point vicinity.The scope of this tomography (lower-most point) can be identified as the on position of exogenous sequences.
In above-described embodiment, belong to normal fragment or abnormal fragment according to the length set and domain of walker determination sequenced fragments thereof.According to one embodiment of present invention, before comparison, first calculate point degree (SD value) partially of Insert Fragment, determine suitable Insert Fragment scope, thus improve the tolerance range of comparison, obtain rational comparison result.The statistics length of Insert Fragment and the occurrence number of this length, and find the highest length of the frequency of occurrences to be designated as M wherein; The Insert Fragment length being not equal to M is designated as x, and occurrence number is designated as n, uses shown formula can calculate SD value:
sd = Σ i = 1 r n i ( x i - M ) 2 / Σ i = 1 r n i
The distribution of Insert Fragment should be similar to Normal Distribution, and vertex is the mean value of Insert Fragment length, and Insert Fragment length has float cap and floating lower limit.According to the number of times that the mean length of Insert Fragment occurs with the Insert Fragment being less than mean length, calculate a left SD value (L-SD), then obtain the value of floating lower limit according to lower value=(mean length-L-SD is worth) of floating; According to the number of times that the mean length of Insert Fragment occurs with the Insert Fragment being greater than mean length, calculate a right SD value (R-SD), obtain the value of float cap according to float cap value=(R-SD value+mean length).The value fluctuated between restriction in Insert Fragment length is rational Insert Fragment scope.
Fig. 3 illustrates the schema of an application examples of transgenosis exogenous sequences insertion point detection method of the present invention.In this application examples, adopt the process that simulation exogenous sequences inserts and checks order, biological sample to be measured is the transgenic arabidopsis inserting exogenous genetic fragment; By the exogenous sequences radom insertion of 10kb in Arabidopis thaliana reference genome, as Arabidopis thaliana transgenic sample, and utilize Maq simulate software will carry out simulation order-checking to Arabidopis thaliana transgenic sample, the result that order-checking obtains is as sequencing data.
Transgenosis exogenous sequences: the fragment of long 10kb in mouse genome, source database: UCSC GenomeBrowser, network address: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/
Reference sequences genome: arabidopsis gene group, source database: Ensembl Genome Browser network address: http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index
Simulation length is that the exogenous sequences of 10kb is inserted in genome, then carries out simulation order-checking.Simulator program is maq simulate, needs to arrange following parameter :-d ,-N ,-1 ,-2, fq1, fq2 and simupars.dat.Below parameters is described in detail :-d parameter is sequenced fragments length, is set to 500; The short fragment sequence sum that the order-checking of-N parametric representation will obtain, this parameter is determined according to the order-checking degree of depth (Sequencing Depth), the order-checking degree of depth is one of index evaluating sequencing quality, represents the ratio of base total amount (bp) and the Genome Size (Genome) checking order and obtain.Utilize formula: the N=order-checking degree of depth × reference genome total length/(2 × reads length) calculates.The present case simulation order-checking degree of depth is 20 to take advantage of, and be 121M with reference to genome total length, short-movie section sequence length is set to 75bp, and-N is set to 16Mb;-1 ,-2 parameters are that two end is resurveyed the length of the short-and-medium fragment sequence of sequence 1 and short fragment sequence 2, are set to 75 in this example; Fq1, fq2 are output file, by the sequencing data after simulation order-checking and short fragment sequence 1 and short fragment sequence 2 respectively with fastq form stored in fq1, fq2 file; Simupars.dat is the system file of maq simulate software, determines length and the mass value of short fragment sequence.
As shown in Figure 3, in step S301, receive sequencing data, carry out sequencing data pre-treatment (due to herein be simulated data, therefore do not carry out data prediction), store with fastq file layout.
In step s 302, comprise two-part content, concrete steps are decomposed into:
(1) the soap comparison of sequencing data and external source insertion sequence;
(2) sequencing data and the soap comparison with reference to genome sequence;
When carrying out the soap comparison of above-mentioned two parts, need to arrange following parameter :-p ,-a ,-b ,-D ,-o ,-2 ,-u ,-m ,-x ,-s ,-l ,-v.Below parameters is described in detail: internal memory required when this script of-p parametric representation runs; During-a parametric representation two end sequencing, input file is attached most importance to the fq1 file (file at short fragment sequence 1 place) checking order and obtain; During-b parametric representation two end sequencing, input file is attached most importance to the fq2 file (file at short fragment sequence 2 place) checking order and obtain;-D parametric representation with reference to genomic sequence with fasta file layout input (the first row of fasta sequential file is by greater-than sign " > " or branch "; " any explanatory note of beginning, for sequence mark; Be sequence itself from the second row, only allow to use set Nucleotide or amino acid coded identification); Output parameter has three ,-o parameter, and the result of output is comparison to reference to the paired short fragment sequence on genome or exogenous sequences, and its output file take .soap as suffix;-2 parameters, its Output rusults is only have a comparison in paired short fragment sequence to reference on genome or exogenous sequences, and output file is using .single as suffix;-u parameter, its Output rusults is non-comparison to reference to the paired short fragment sequence on genome or exogenous sequences, and output file is using .unmap as suffix; Do not arrange-t parameter to be to retain the primary ID number of short fragment sequence;-m ,-x parameter is the domain of walker of Insert Fragment, and-m parameter refers to the floating lower limit of sequenced fragments, i.e. (negative percentage ratio × sequenced fragments length), and-x parameter refers to the float cap of sequenced fragments, i.e. (positive percentage ratio × sequenced fragments length).In the present invention, find qualified short fragment sequence in order to maximum range, the domain of walker of sequenced fragments relaxed ,-m ,-x parameter is set to sequenced fragments length ± 0.88 × sequenced fragments length respectively;-s parameter is minimum comparison length, is set to 40;-l parameter is Seed Sequences (3 ' end error rate of long segment sequence is high, sets the sequence of certain length as Seed Sequences from the 5 ' end) length in initial comparison, is set to 32; What during the comparison of-v parametric representation, a short fragment sequence allows the most very much not mates number, and this optimum configurations is as far as possible little in the present invention, to ensure precise alignment.Should be noted that the consistence of the setting of soap parameter.
In step S303, extract the short fragment sequence on sequencing data and external source Insert Fragment sequence alignment, stored in document single file 1;
In step s 304, sequencing data and the short fragment sequence on reference genome alignment is extracted, stored in document single file 2;
In step S305, process the short fragment sequence in single file, removing paired short fragment sequence in single file and retaining comparison value is that the single reads of 1 is to ensure the specificity of short-movie section sequence alignment.Occur that the reason of paired short fragment sequence is provided with the domain of walker of sequenced fragments when being comparison, such as value range is ± X × Insert Fragment length, the short fragment sequence of paired exception in comparison result not within the scope of this, also can be put in single file.So, for ensureing the specificity of short-movie section sequence alignment, short fragment sequence paired in single file need be removed, retaining the single reads that comparison number is 1.
In step S306, according to the order of each sample number, the single reads in the single file 2 obtained in step S305 is sorted, and press the separation of karyomit(e) order.
In step S307, determine the corresponding position of single reads on reference genome obtained with exogenous sequences sequence alignment, single file 1 is just needed to get common factor with the single reads in the single file2 obtained after step S306 sorts, to determine the position of occuring simultaneously on reference genome.
In step S308, the common factor obtained is sorted in the size with reference to corresponding site on genome according to it, adds the length of short fragment sequence, the support of the short fragment sequence in each site of statistical-reference sequence.
Have the peak of a short-movie section sequence enrichment at insertion point two ends, more trend towards insertion point, short fragment sequence support number can reduce gradually, occurs the tomography of a short fragment sequence in insertion point vicinity.The scope of this tomography, is defined as the on position (position as in the dotted line circle of arrow indication in Fig. 4) of exogenous sequences.
The present invention can determine a scope (several base to more than 100 bases not etc.) of the insertion point of exogenous sequences more accurately, though the insertion point of exogenous sequences accurately can not be located, insertion point can be found out accurately in conjunction with normal PCR (polymerase chain reaction) experiment.Through a large amount of simulated experiments checking, the detection efficiency of this invention is higher, and recall rate is 92.15% ± 0.01%, and false positive rate is 3.87% ± 0.013%, and false negative rate is 4.4% ± 0.05%.The present invention detects insertion point by information biology means, and the cycle is fast, cost is low, solves current pure experimental technique detection efficiency low, the problem that Expenses Cost is high.
Fig. 5 illustrates the structure iron of an embodiment of transgenosis exogenous sequences insertion point detection system of the present invention.As shown in Figure 5, this system comprises order-checking unit 51, external source one-sided short-movie section determining unit 52, genome one-sided short-movie section determining unit 53 and insertion point determining unit 54.Order-checking unit 51 is crossed and is carried out two end sequence of resurveying to the sequenced fragments of testing sample and obtain paired short fragment sequence; Paired short fragment sequence and external source Insert Fragment sequence are compared and are determined the one-sided short-movie section (singlereads) of external source by external source one-sided short-movie section determining unit 52; Paired short fragment sequence and reference genome sequence are compared and are determined the one-sided short-movie section of genome by genome one-sided short-movie section determining unit 53; Insertion point determining unit 54 determines the insertion point of exogenous sequences in genome sequence according to the common factor of external source one-sided short-movie section and the one-sided short-movie section of genome.Wherein, external source one-sided short-movie section comprises and only has a short-movie section sequence alignment to the paired short fragment sequence in external source Insert Fragment sequence; Genome one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence on reference genome sequence.According to one embodiment of present invention, external source one-sided short-movie section comprise only have a short fragment sequence and for once comparison to the normal paired short fragment sequence in external source Insert Fragment sequence; Genome one-sided short-movie section comprise only have a short fragment sequence and for once comparison to reference to the normal paired short fragment sequence on genome sequence.
According to one embodiment of present invention, the length of the sequenced fragments of testing sample is 170-500bp, 500-1000bp, 1000-2000bp, 2000-10000bp; The length of short fragment sequence is 40-75bp, 75bp-200bp; The sequenced fragments of testing sample check order the base total amount that obtains be with reference to genome sequence base total amount 5-10 doubly, 10-20 doubly or more than 20 times.
Fig. 6 illustrates the structure iron of another embodiment of transgenosis exogenous sequences insertion point detection system of the present invention.Compare with Fig. 5, the system of this embodiment also comprises filtering unit 65, for being filtered into short fragment sequence to remove underproof short fragment sequence.Such as, filtering unit 65 removes sequencing quality exceedes short-movie section series number 50% paired short fragment sequence lower than the base number of predetermined threshold; Remove the paired short fragment sequence that sequencing result uncertain base number exceedes paired short-movie section series number 10%; Remove the joint sequence in paired short fragment sequence.
In above-described embodiment, remove underproof short fragment sequence by filtering unit, the accuracy of detection can be improved.
For the function of each device or unit in Fig. 5 to Fig. 6, can with reference to above about the explanation of corresponding part in the embodiment of the inventive method, for for purpose of brevity, be not described in detail in this.
Detection method provided by the invention and system, based on the heavy sequencing technologies of full-length genome, solve existing four kinds of variation detection techniques and accurately can not detect variant sites and the indeterminable problem of other experimental techniques, accuracy is good, simple and efficient, cost is low, and detecting for the detection of genetically modified organism and products thereof with in supervising about the variation of large fragment provides science, accurate, reliable detection means.
It will be understood by those of skill in the art that for each device in Fig. 5 to Fig. 6, can be realized by independent calculating treatmenting equipment, or be integrated into an independently equipment realization.In Fig. 5 to Fig. 6, with frame, the function that they are described is shown.These functional blocks can realize by hardware, software, firmware, middleware, microcode, hardware description voice or their arbitrary combination.For example, one or two functional blocks can utilize the codes implement operated on microprocessor, digital signal processor (DSP) or any other suitable computing equipment.Code can represent the arbitrary combination of process, function, sub-routine, program, routine, subroutine, module or instruction, data structure or program statement.Code can be arranged in computer-readable medium.Computer-readable medium can comprise one or more storing device, such as, RAM storer, flash memories, ROM storer, eprom memory, eeprom memory, register, hard disk, portable hard drive, CD-ROM or other any type of storage medias well known in the art are comprised.Computer-readable medium can also comprise the carrier wave of encoded data signal.
Those skilled in the art will recognize that hardware, firmware and software arrangements replaceability in these cases, and with how realizing each application-specific best this function.

Claims (8)

1. a transgenosis exogenous sequences insertion point detection method, is characterized in that, comprising:
Paired short fragment sequence and external source Insert Fragment sequence are compared and determine the one-sided short-movie section of external source, described paired short fragment sequence obtains by carrying out two end sequence of resurveying to the sequenced fragments of testing sample;
Described paired short fragment sequence and reference genome sequence are compared and determines the one-sided short-movie section of genome;
Common factor according to described external source one-sided short-movie section and the one-sided short-movie section of genome determines the insertion point of exogenous sequences in genome sequence,
Wherein, described external source one-sided short-movie section comprises and only has a short-movie section sequence alignment to the paired short fragment sequence in described external source Insert Fragment sequence;
Described genome one-sided short-movie section comprises only has a short-movie section sequence alignment to described with reference to the paired short fragment sequence on genome sequence.
2. detection method according to claim 1, is characterized in that, described external source one-sided short-movie section comprise only have a short fragment sequence and for once comparison to the normal paired short fragment sequence in described external source Insert Fragment sequence; Described genome one-sided short-movie section comprise only have a short fragment sequence and for once comparison to described with reference to the normal paired short fragment sequence on genome sequence.
3. detection method according to claim 1, is characterized in that, also comprises:
Filter described paired short fragment sequence to remove underproof short fragment sequence,
Wherein, the described paired short fragment sequence of described filtration comprises to remove underproof paired short fragment sequence:
Remove sequencing quality exceedes described short-movie section series number 50% paired short fragment sequence lower than the base number of predetermined threshold;
And/or
Remove the paired short fragment sequence that sequencing result uncertain base number exceedes described paired short-movie section series number 10%;
And/or
Remove the joint sequence in described paired short fragment sequence.
4. detection method according to claim 1, is characterized in that, the length of the sequenced fragments of described testing sample is 170-500bp, 500-1000bp, 1000-2000bp or 2000-10000bp;
And/or
The length of described short fragment sequence is for being selected from following at least one 40-75bp, 75bp-200bp;
And/or
The sequenced fragments of described testing sample check order the base total amount that obtains be the described 5-10 with reference to genome sequence base total amount doubly, 10-20 doubly or more than 20 times.
5. a transgenosis exogenous sequences insertion point detection system, is characterized in that, comprising:
Order-checking unit, obtains paired short fragment sequence for carrying out two end sequence of resurveying to the sequenced fragments of testing sample;
External source one-sided short-movie section determining unit, determines the one-sided short-movie section of external source for described paired short fragment sequence and external source Insert Fragment sequence being compared;
Genome one-sided short-movie section determining unit, compares described paired short fragment sequence and reference genome sequence and determines the one-sided short-movie section of genome; With
Insertion point determining unit, for determining the insertion point of exogenous sequences in genome sequence according to the common factor of described external source one-sided short-movie section and the one-sided short-movie section of genome.
6. detection system according to claim 5, is characterized in that, described external source one-sided short-movie section comprises only has a short-movie section sequence alignment to the paired short fragment sequence in described external source Insert Fragment sequence;
Described genome one-sided short-movie section comprises only has a short-movie section sequence alignment to described with reference to the paired short fragment sequence on genome sequence,
Wherein, described external source one-sided short-movie section comprise only have a short fragment sequence and for once comparison to the normal paired short fragment sequence in described external source Insert Fragment sequence; Described genome one-sided short-movie section comprise only have a short fragment sequence and for once comparison to described with reference to the normal paired short fragment sequence on genome sequence.
7. detection system according to claim 6, is characterized in that, also comprises:
Filtering unit, for filtering described paired short fragment sequence to remove underproof short fragment sequence,
Wherein, described filtering unit removes sequencing quality exceedes described short-movie section series number 50% paired short fragment sequence lower than the base number of predetermined threshold;
And/or
Remove the paired short fragment sequence that sequencing result uncertain base number exceedes described paired short-movie section series number 10%;
And/or
Remove the joint sequence in described paired short fragment sequence.
8. detection system according to claim 6, is characterized in that, the length of the sequenced fragments of described testing sample is 170-500bp, 500-1000bp, 1000-2000bp or 2000-10000bp;
And/or
The length of described short fragment sequence is for being selected from following at least one 40-75bp, 75bp-200bp;
And/or
The sequenced fragments of described testing sample check order the base total amount that obtains be the described 5-10 with reference to genome sequence base total amount doubly, 10-20 doubly or more than 20 times.
CN201180062256.XA 2011-01-20 2011-01-20 Method and system for detecting the insertion sites of transgenic foreign fragments Active CN103270175B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/000095 WO2012097474A1 (en) 2011-01-20 2011-01-20 Method and system for detecting the insertion sites of transgenic foreign fragments

Publications (2)

Publication Number Publication Date
CN103270175A CN103270175A (en) 2013-08-28
CN103270175B true CN103270175B (en) 2015-06-24

Family

ID=46515067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201180062256.XA Active CN103270175B (en) 2011-01-20 2011-01-20 Method and system for detecting the insertion sites of transgenic foreign fragments

Country Status (2)

Country Link
CN (1) CN103270175B (en)
WO (1) WO2012097474A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2654575C2 (en) * 2013-05-15 2018-05-21 БиДжиАй Дженомикс Ко., Лтд. Method for detecting chromosomal structural abnormalities and device therefor
CN103993069B (en) * 2014-03-21 2020-04-28 深圳华大基因科技服务有限公司 Virus integration site capture sequencing analysis method
CN107630079B (en) * 2016-07-19 2020-07-28 中国农业科学院作物科学研究所 Method for determining the sequence, insertion position and border sequence of foreign DNA fragments in transgenic organisms
CN108959853B (en) * 2018-05-18 2020-01-17 广州金域医学检验中心有限公司 Analysis method, analysis device, equipment and storage medium for copy number variation
CN110556165B (en) 2019-09-12 2022-03-18 浙江大学 Method for rapidly identifying transgene or gene editing material and insertion site thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999053095A2 (en) * 1998-04-09 1999-10-21 Whitehead Institute For Biomedical Research Biallelic markers
CN101646782A (en) * 2007-01-29 2010-02-10 科学公共卫生研究所(Iph) The transgenic plant event detection
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999053095A2 (en) * 1998-04-09 1999-10-21 Whitehead Institute For Biomedical Research Biallelic markers
CN101646782A (en) * 2007-01-29 2010-02-10 科学公共卫生研究所(Iph) The transgenic plant event detection
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
转基因食品的检测方法;邵素琴;《农药科学与管理》;20021231;第23卷(第3期);全文 *

Also Published As

Publication number Publication date
CN103270175A (en) 2013-08-28
WO2012097474A1 (en) 2012-07-26

Similar Documents

Publication Publication Date Title
Morales‐Briones et al. Phylogenomic analyses reveal a deep history of hybridization and polyploidy in the Neotropical genus Lachemilla (Rosaceae)
Smith et al. Widespread purifying selection on RNA structure in mammals
Jamy et al. Long‐read metabarcoding of the eukaryotic rDNA operon to phylogenetically and taxonomically resolve environmental diversity
Dumschott et al. Oxford Nanopore sequencing: new opportunities for plant genomics?
Bailet et al. Diatom DNA metabarcoding for ecological assessment: Comparison among bioinformatics pipelines used in six European countries reveals the need for standardization
CN103270175B (en) Method and system for detecting the insertion sites of transgenic foreign fragments
Guichoux et al. Outlier loci highlight the direction of introgression in oaks
Bragg et al. Metagenomics using next-generation sequencing
CN103080333B (en) Methods and systems for detecting genomic structure variations
CN103088120B (en) Large-scale genetic typing method based on SLAF-seq (Specific-Locus Amplified Fragment Sequencing) technology
US20140323320A1 (en) Method of detecting fused transcripts and system thereof
CN104657628A (en) Proton-based transcriptome sequencing data comparison and analysis method and system
CN104504304A (en) Method and device for identifying clustered regularly interspaces short palindromic repeats (CRISPR)
CN104145028B (en) A kind of detect the micro-deleted method in chromosome STS region and device thereof
Gogol-Döring et al. An overview of the analysis of next generation sequencing data
CN102206704A (en) Method and device for assembling genome sequence
CN105483244A (en) Super-long genome-based variation detection algorithm and detection system
CN106033502A (en) Virus identification method and device
Barton et al. New methods for inferring the distribution of fitness effects for INDELs and SNPs
Krinos et al. EUKulele: Taxonomic annotation of the unsung eukaryotic microbes
Sauvage et al. Promising prospects of nanopore sequencing for algal hologenomics and structural variation discovery
Song et al. Reads binning improves alignment-free metagenome comparison
Schmitz et al. Quality control and evaluation of plant epigenomics data
Goussarov et al. Introduction to the principles and methods underlying the recovery of metagenome‐assembled genomes from metagenomic data
Lammers et al. Phylogenetic conflict in bears identified by automated discovery of transposable element insertions in low-coverage genomes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and system for detecting the insertion sites of transgenic foreign fragments

Effective date of registration: 20170213

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2017990000100

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20171201

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2017990000100

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and system for detecting the insertion sites of transgenic foreign fragments

Effective date of registration: 20171201

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2017440020061

PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20181026

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2017440020061

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and system for detecting the insertion sites of transgenic foreign fragments

Effective date of registration: 20181029

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2018440020067

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20191225

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: 2018440020067

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and system for detecting the insertion sites of transgenic foreign fragments

Effective date of registration: 20191227

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: Y2019980001361

PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20150624

Pledgee: Bank of China Limited by Share Ltd. Shenzhen East Branch

Pledgor: BGI SHENZHEN Co.,Ltd.

Registration number: Y2019980001361