WO2012034251A2 - 一种基因组结构性变异检测方法和*** - Google Patents

一种基因组结构性变异检测方法和*** Download PDF

Info

Publication number
WO2012034251A2
WO2012034251A2 PCT/CN2010/001409 CN2010001409W WO2012034251A2 WO 2012034251 A2 WO2012034251 A2 WO 2012034251A2 CN 2010001409 W CN2010001409 W CN 2010001409W WO 2012034251 A2 WO2012034251 A2 WO 2012034251A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
variation
sequencing
alignment
information
Prior art date
Application number
PCT/CN2010/001409
Other languages
English (en)
French (fr)
Inventor
罗锐邦
邵浩靖
林浩翔
Original Assignee
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司 filed Critical 深圳华大基因科技有限公司
Priority to CN201080068345.0A priority Critical patent/CN103080333B/zh
Priority to PCT/CN2010/001409 priority patent/WO2012034251A2/zh
Publication of WO2012034251A2 publication Critical patent/WO2012034251A2/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing

Definitions

  • the invention relates to the technical field of bioinformatics, in particular to a method and system for detecting a structural variation (SV) of a genome.
  • Background technique SV
  • Structural variability plays an important role in the genome, and structural variability may lead to changes in individual gene coding and functional changes.
  • biologists have identified a large number of candidate regions of the genome associated with human disease through genetic linkage or association analysis. However, identifying disease-causing genes or mutations in these regions requires re-sequencing these regions.
  • Existing genome-wide resequencing analysis techniques are costly, and the information obtained through genome-wide resequencing analysis techniques contains a wealth of redundant information for some studies and individual medical guidance. In order to improve the efficiency of obtaining effective information, the concentration of existing genetic analysis techniques in high-value genetic research areas is of great significance for scientific research and medical guidance.
  • One technical problem to be solved by one aspect of the present disclosure is to provide a genome
  • the structural variation detection method has higher accuracy.
  • One aspect of the present disclosure provides a method for detecting a genomic structural variation, comprising:
  • the skeleton sequence is globally paired with the reference genome to obtain a comparison result containing the variation information
  • the extraction step extracts the variation information from the comparison result containing the variation information.
  • the method before the assembling step, further comprises: an optimizing step of optimizing the sequencing sequence by comparing the reference genome to obtain an optimized sequencing sequence;
  • the assembly step includes: assembling the optimized sequencing sequences into a backbone sequence.
  • the method further includes: a verifying step of verifying the extracted mutated information to remove the mutated information that has not passed the verification.
  • the verifying step comprises:
  • variants with a length less than 50 bp in the variation information, construct a variant sequence, and perform a gap alignment between the sequencing sequence and the variant sequence by a short sequence alignment tool. If the alignment result conforms to the logical theory comparison result, the verification is passed, otherwise the verification is not verified. , remove the variation.
  • the extracting step further comprises:
  • the comparison result containing the variation information is processed as follows:
  • the optimizing step comprises:
  • the sequencing sequence is aligned to the reference genome by a short sequence alignment tool to obtain an alignment sequence
  • the optimization steps also include:
  • the repeated sequencing sequence is removed by a short sequence alignment tool
  • the sequencing sequence in which the average mass in the aligned sequence is below a predetermined value is removed.
  • the assembling step includes:
  • the double-ended relationship obtained by sequencing is used to construct a skeleton sequence according to the contig; the skeleton sequence is complemented to obtain the final skeleton sequence.
  • the skeleton sequence is obtained, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome is obtained with high accuracy.
  • One technical problem to be solved by another aspect of the present disclosure is to provide a genetic variation detection system for genomes with higher accuracy.
  • genomic structural variation detection system comprising:
  • An assembly device for assembling a sequencing sequence into a scaffold sequence is configured to perform a global pairwise alignment of the skeleton sequence on the reference genome to obtain a comparison result containing the variation information;
  • An extracting means for extracting the mutated information from the aligning result containing the mutated information.
  • system further comprises:
  • the assembly device is used to assemble the optimized sequencing sequences into a backbone sequence.
  • system further comprises:
  • the verification device is configured to verify the extracted variation information and remove the variation information that has not passed the verification.
  • the verification device determines whether the repeatability is less than 10% for a variation of 50 bp or longer in the variation information, and if so, constructs a variation sequence, and compares the sequence of the sequence to the variation sequence, if the depth of the variation sequence If the logical theory distribution is met, the verification is passed, otherwise the mutation is not verified, and the mutation is removed. If the repeatability is greater than or equal to 10%, the mutation site extension sequence is judged whether there is no repeatability. If so, the mutation sequence is constructed, and the sequencing sequence is aligned.
  • the upper variation sequence, the extended sequence alignment depth feature conforms to the logical theory distribution, and is verified, otherwise removed; for the variation of the variation information less than 50 bp in length, the variation sequence is constructed, and the short sequence alignment tool is used to perform the gap ratio between the sequencing sequence and the variation sequence. Yes, if the comparison result conforms to the logical theory comparison result, pass the verification, otherwise the verification is not passed, and the variation is removed.
  • the extraction device comprises:
  • a mutation information filtering unit for filtering or re-running the abnormal result of the comparison result containing the variation information; and/or filtering the logical error result; and/or removing the uncommon result, and outputting the filtered comparison result;
  • a mutation information extraction unit for filtering from the output of the mutation information filtering unit
  • the alignment results extract mutated information.
  • the optimization device comprises:
  • a comparison unit configured to compare the sequencing sequence to the reference genome to obtain a alignment sequence
  • a filtering unit configured to filter the sequence, and remove the sequence whose average quality is lower than a predetermined value in the comparison result
  • the assembly apparatus includes:
  • a map construction unit for constructing a Debron map after cutting the optimized sequencing sequence into an N-mer
  • a cutting unit for outputting a ring structure in the Debrunn diagram, cutting the Debrunn diagram into a plurality of contigs and a heterozygous sequence
  • the skeleton construction unit is configured to construct a skeleton sequence according to a plurality of overlapping groups by using the double-end relationship obtained by sequencing, and fill the skeleton sequence to obtain a final skeleton sequence.
  • the whole genome sequencing result is assembled by an assembly device to obtain a skeleton sequence, and the skeleton sequence and the reference genome are globally compared by a comparison device to obtain a personal unique genome irrelevant to the reference genome. , high accuracy.
  • Figure 1 is a flow chart showing one embodiment of the genomic structural variation detecting method of the present invention
  • Figure 2 is a flow chart showing another embodiment of the method for detecting genomic structural variation of the present invention.
  • Figure 3 is a flow chart showing still another embodiment of the method for detecting genomic structural variation of the present invention.
  • Figure 4 shows an embodiment of the genomic structural variation detecting system of the present invention Structure diagram
  • Figure 5 is a view showing the structure of another embodiment of the genomic structural variation detecting system of the present invention.
  • Fig. 6 is a view showing the configuration of still another embodiment of the genomic structural mutation detecting system of the present invention. detailed description
  • the method and system for detecting structural variation based on assembly is a method for performing a series of biological information analysis on genomic DNA sequence information and a related analysis tool, aiming at solving the problem of imperfect genomic bioinformatics analysis methods and tools.
  • Figure 1 is a flow chart showing one embodiment of the genomic structural variation detecting method of the present invention.
  • Step 102 an assembly step.
  • the sequencing sequences are assembled into a scaffold sequence (scaffold).
  • a scaffold sequence for example, by cutting the sequencing sequence into N-mers and constructing a Debrunn diagram, the partial ring structure in the Debrunn diagram is output, and the De Bruen map is cut into a plurality of contigs. , and hybrid sequences; using the double-ended relationship obtained by sequencing to process the contigs to construct the skeleton sequence.
  • the skeleton sequence is complemented by the base "N" to obtain the final skeleton sequence.
  • Step 104 the comparison step.
  • the skeleton sequence is globally pairwise aligned with the reference genome to obtain a comparison result containing the variation information.
  • the assembly results obtained in step 102 are globally aligned using a long sequence alignment software with a reference genome.
  • the long sequence alignment software is, for example, LASTZ, and can be found in the reference [ Harris, RS Improved pair ise alignment of genomic DNA. PhD thesis, Pennsylvania State University (2007)].
  • Step 106 The extracting step extracts the variation information from the comparison result containing the variation information.
  • the mutation information includes the location of the mutation site, the type of variation, and the sequence of the mutation.
  • the whole genome sequencing results were assembled to obtain a skeleton sequence, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome was obtained with high accuracy.
  • Fig. 2 is a flow chart showing another embodiment of the genomic structural variation detecting method of the present invention.
  • step 202 an optimization step.
  • the sequenced sequence is optimized by aligning the reference genome to obtain an optimized sequencing sequence.
  • the aligned sequences are aligned by the sequence alignment tool and the reference genome to obtain aligned sequences, and the aligned sequences are optimized, such as deduplicating, replacing the wrong bases, and filtering, and converting into optimized sequencing sequences.
  • the alignment of the sequencing sequence and the reference genome is performed by the BWA software, and the specific parameters of the BWA are "aln -e O -o O".
  • the meaning of this parameter is: "aln” is a sub-function of BWA, the role is the comparison; "-e” means that the gap can be compared
  • the deduplication process of the comparison sequence refers to the removal of some sequence regions with high repetition. For example, a sequence region of ATCATCATCATCATC containing multiple ATCs will have an effect on the comparison pair and such sequence regions should be excluded.
  • the replacement error base of the aligned sequence is treated by replacing all of the incorrect alignment bases of the referenced reference genome with the bases that are identical to the reference genome.
  • the filtering process of the aligned sequence is to remove a sequence whose average quality value is lower than a predetermined value X; for example, the parameter X is based on the sequenced sequence.
  • the recommended value range is, for example, [10-20], and the corresponding average error rate is [10%-1%].
  • the option is 15. By optimizing the sequencing sequence, the accuracy of the next processing can be improved.
  • Step 204 an assembly step.
  • the optimized sequencing sequences are assembled into a backbone sequence. For example, it was assembled using the software Soapdenovo developed by the Huada Gene Research Institute. The specific assembly parameter is "-K 31", where the parameter " - K” is used to set the value of the K- mer.
  • Soapdenovo software can be found in the reference: [Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res (2009)].
  • Step 206 the comparison step.
  • the skeleton sequence and the reference genome are globally pairwise, and the alignment result containing the variation information is obtained.
  • LASTZ is used to globally align the skeleton sequence to the reference genome.
  • the parameter definition can be found in the LASTZ software documentation.
  • One chain means linking, and " ⁇ ambiguousN" means treating N as multiple base types.
  • Gapped refers to the gap comparison
  • a noentropy means that high-precision results are filtered without introducing entropy.
  • “12 ⁇ 9” is the seed mode of 12 ⁇ 9.
  • One seed is a 19 base length sequence selected by the software setting rules in the reference sequence. Whether the target sequence can match the seed sequence only considers the 12 base position in the seed set by the software. If the seed regions are aligned, the alignment will extend in both directions starting from the seed region until the alignment is completed, and the alignment results are output.
  • Step 208 an extraction step.
  • the comparison result including the variation information is filtered, and the variation information in the filtered comparison result is extracted. Filtering includes: (1) filtering or reruning abnormal results, (2) filtering logical error results, and (3) common results are incomplete.
  • Filter or rerun the abnormal result Filter the abnormal result in laste, filter the meaningless part of the comment in the lastz result, and re-run the lastz program without the normal end identifier.
  • Filtering logic error results This includes an assembly sequence that compares two or more chromosomes, and the same position of one chromosome is aligned with two or more assembly sequences, and a better quality retention is selected from these results. .
  • N ACGT is possible
  • - alignment gap
  • Step 210 the face verification step.
  • the extracted variation information is verified to remove the mutated information that has not been verified.
  • the candidate mutation information can be verified by various calculation methods to remove the unqualified mutation information. For example, verification is performed by depth and sequence cutting methods. For mutations greater than or equal to 50 bp in length, first construct a variant sequence, and then compare the sequence of the sequence to the sequence of the variant, and if the depth of the variant sequence conforms to the logical theory distribution, pass the verification, otherwise remove; for the variation of less than 50 bp in length, first construct the mutation Sequence, and then use sequence alignment software such as BWA to perform gap-to-sequence alignment of sequencing short and variant sequences, and the alignment parameter is "-e 50 -0 1 - i 5", if the alignment result is in accordance with the logical theory comparison result Pass verification, otherwise remove.
  • sequence alignment software such as BWA to perform gap-to-sequence alignment of sequencing short and variant sequences
  • the depth of the above-mentioned mutated sequence conforms to the logical theory distribution. If the target sequence is consistent with the reference sequence, the depth of each point in the region should have a relatively high value, and the depth of each point is relatively close, and vice versa. Relatively low.
  • optimization step and the verification step may be included as one or both of the optional steps of the embodiment of the present invention.
  • the accuracy of the next processing can be improved by optimizing the sequencing sequence.
  • Various methods for genome-wide candidate structural variation sets The method performs verification to remove the mutated information that has not been verified, so that the false positive of the mutated information is low. Experiments show that the method of the embodiment of the present invention can obtain false positives.
  • Fig. 3 is a flow chart showing still another embodiment of the method for detecting genomic structural variation of the present invention.
  • step 301 the BWA is aligned.
  • the alignment sequence was compared with the reference genome by BWA software to obtain a comparison sequence.
  • the BWA is repeated. Eliminate highly repetitive sequences with BWA software.
  • the error alignment base is replaced with a reference sequence base and filtered according to the quality value. Substitution of all mismatch bases of the reference reference genome to bases consistent with the reference genome, and removal of sequences having an average quality value below a predetermined value X.
  • a spliced De Bruen diagram is generated.
  • the contig and the hybrid sequence are output according to the Debron map.
  • step 306 a contig or hybrid sequence is obtained.
  • steps 307 through 309 process the contigs and the hybrid sequences, respectively.
  • the reference sequence and the spliced result sequence are segmented, where the resulting sequence refers to the contig and the heterozygous sequence.
  • splitting into two pairs of pairs is performed.
  • the reference sequence and the result sequence are split into multiple copies and then aligned with a small sequence from the resulting sequence using a split small sequence from the reference sequence, until all small sequences are aligned.
  • step 309 the comparison is performed by «, to logical ⁇ , and the variation information is output.
  • step 310 the mutation information is obtained.
  • step 311 it is determined whether the length of the mutation is greater than or equal to 50 bp base pairs. If yes, proceed to step 312, otherwise, continue to step 317.
  • sequence repeatability is calculated. Comparing the information of a certain region of the sequence with the information in the repetitive sequence library to determine whether the sequence is consistent; if the agreement is consistent, the sequence is determined
  • the area is a repeating sequence area. It is also possible that the sequence is a repeat region. Sequence repeatability can be calculated by calculating the ratio of the length of the repeat region to the entire sequence.
  • step 313 it is determined if the repeatability is less than 10%, and if so, then step 316 is continued, otherwise, step 314 is continued.
  • step 314 it is determined if the variant site extension sequence is non-repetitive, and if so, the root proceeds to step 315.
  • a variant sequence is obtained that is aligned with the reference sequence.
  • the sequence of ⁇ i is obtained according to the depth feature of the extended sequence, and the variation result is output.
  • a variant sequence is obtained that is aligned with the reference sequence. If the variant sequence is correct, the alignment depth of the variant sequence will be higher and more average. The variation is obtained based on the depth ratio, and the variation result is output.
  • a single-ended or double-ended BWA alignment result with gaps is obtained.
  • sequence sequences There are two types of sequence sequences, one is single-end, the other is pair-end, and the different methods are different when BWA is compared. For details, please see: http://bio-bwa.sourceforge.net/bwa.shtml.
  • each variant site will have positional information, find this position in the reference sequence, and intercept the sequence of a certain length before and after this position and connect it with the mutation sequence of the mutation site to become a new sequence.
  • the BWAs with gaps are aligned.
  • the BWA is aligned with the -o 1 parameter, allowing gaps in the target sequence to be aligned with the reference sequence, or no gaps.
  • step 322 the verification variation is obtained according to the gap condition and the depth distribution of the comparison result, and the variation result is output.
  • Application Example 1 human exon capture sequencing.
  • NA12156 exon sequencing Take the NA12156 exon sequencing as an example of the International Human Genome HapMap (Sample No.: NA12156; download address ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX005/SRX005923).
  • Raw data a total of 11346285 short sequences.
  • the sequencing results of human exon NA12156 were filtered and optimized based on the reference genome using the basic software BWA tool and filter software; the sequence optimized by filtration was assembled with soapdenovo; the assembly results were analyzed using software LASTZ software.
  • the reference genome was pairwise compared.
  • the comparison results were filtered by the extracted structural mutation information software and the abnormal results were removed.
  • the verification of the structural variation software was used to verify the depth and sequence cutting methods. For mutations greater than or equal to 50 bp in length, determine whether the repeatability is less than 10%. If yes, construct a variant sequence, and compare the sequencing sequence to the upper variant sequence.
  • the depth of the variant sequence conforms to the logical theory distribution, pass the verification, otherwise remove If the repeatability is greater than or equal to 10%, it is judged whether the mutation site extension sequence has no repetitiveness, and if so, the mutation sequence is constructed, and then the sequencing sequence is aligned with the upper variation sequence, and the extension sequence alignment depth characteristic is in accordance with the logical theory distribution. Validation, otherwise removed; for variants less than 50 bp in length, construct a variant sequence, then use BWA for gap-to-gap alignment, the alignment parameter is -e 50 -0 1 -i 5 , if the alignment results are in logical alignment Then pass the verification, otherwise it is removed. Finally merge the two to get the final result. Specific steps are as follows:
  • the optimized short sequence was assembled, and the assembled result genome size was 218030396 bp, with 3941732 assembly sequences, and the assembly sequence was the longest. 9042 bp, N50 is 298 bp and N90 is 122 bp.
  • This application example takes the exon sequencing of colon cancerous cancer cells as an example (sample number: Yv 090508).
  • the original data has a total of 105,972,839 short sequences (sequencing sequences).
  • the optimized short sequence was assembled.
  • the assembled genome size was 118938172 bp, and there were 253868 assembly sequences.
  • the assembly sequence was up to 16885 bp, N50 was 793 bp and N90 was 170 bp.
  • the application example is a Vibrio parahaemolyticus (sample number: VIBydvDlOpoolingIAAPEI-9-l).
  • the original data has a total of 563,1982 short sequences.
  • the assembled result has a genome size of 5056512 bp and 684 assembled sequences.
  • the assembled sequence is 94989 bp, N50 is 23988 bp and N90 is 5603 bp.
  • the alignment of the assembled sequence with the reference genome contained 1442 alignment results.
  • Table 3 Figure 4 is a block diagram showing one embodiment of the genomic structural variation detecting system of the present invention.
  • the structural variation detecting system 400 of this embodiment includes an assembling device 41, a comparing device 42, and an extracting device 43.
  • the assembly device 41 assembles the sequencing sequence into a skeleton sequence (scaffold), and outputs a skeleton sequence;
  • the comparison device 42 performs a global pairwise alignment of the skeleton sequence output from the assembly device 41 on the reference genome to obtain a comparison result containing the variation information;
  • the extracting means 43 extracts the variation information from the comparison result containing the variation information.
  • the whole genome sequencing result is assembled by the assembly device to obtain the skeleton sequence, and the skeleton sequence and the reference genome are globally compared by the comparison device, and the individual-specific genome irrelevant to the reference genome is obtained, and the accuracy is high.
  • Figure 5 is a block diagram showing another embodiment of the genomic structural variation detecting system of the present invention.
  • the structural variation detecting system 400 of the embodiment further includes an optimizing device 50 and a verifying device 54.
  • the optimization device 50 optimizes the sequencing sequence by comparing the reference genomes to obtain an optimized sequencing sequence, and transmits the optimized sequencing sequence to the assembly device 41.
  • the assembly device 41 assembles the optimized sequencing sequences into a scaffold sequence.
  • the optimizing device 50 compares the sequencing sequence and the reference genome by the short sequence alignment software to obtain a aligned sequence, and then performs optimization processing such as deduplication, substitution, and filtering on the sequence to obtain an optimized sequencing sequence.
  • the verification device 54 verifies the extracted variation information and removes the variation information that has not been verified.
  • the verification device 54 can verify the candidate variation information by various calculation methods to remove the unvalidated variation information, for example, by depth and sequence cutting methods.
  • the verification device determines whether the repeatability is less than 10% for a variation of 50 bp or longer in the variation information, and if so, constructs a mutation sequence, and compares the sequence of the sequence to the mutation sequence if the sequence of the mutation If the depth conforms to the logical theory distribution, it is verified, otherwise the mutation is not passed, and the mutation is removed; if the repeatability is greater than or equal to 10%, it is judged whether the mutation site extension sequence has no repetitiveness, and if so, the mutation sequence is constructed, and the sequencing sequence is constructed.
  • the extended sequence alignment depth characteristic conforms to the logical theory distribution and is verified, otherwise removed; for the variability of less than 50 bp in the variation information, the mutated sequence is constructed, and the sequencing sequence and the mutated sequence are performed by the short sequence alignment tool. The gap is compared. If the comparison result is in accordance with the logical theory comparison result, the verification is passed, otherwise the verification is not passed, and the variation is removed.
  • the sequencing sequence is optimized by the optimization device, Can improve the accuracy of the next step.
  • the verification device performs a plurality of methods for verifying the genome-wide candidate structural variation set, and removes the unqualified mutation information, so that the false positive of the variation information is low. It has been experimentally shown that the method of the embodiment of the present invention can obtain a structural variation set of less than 10% of false positives.
  • Fig. 6 is a view showing the configuration of still another embodiment of the genomic structural mutation detecting system of the present invention.
  • the optimizing means 50 includes a comparing unit 501, a filtering unit 502, and an error base replacement unit 503.
  • the assembly device 41 includes a map construction unit 411, a cutting unit 412, and a skeleton construction unit 413.
  • the extracting means 43 includes a variation information filtering unit 431 and a variation information extracting unit 432.
  • the comparing unit 501 compares the sequencing sequence with the reference genome to obtain a matching sequence; the filtering unit 502 is configured to compare and filter the sequence, and remove the sequence whose average quality in the alignment queue is lower than a predetermined value; the wrong base replacing unit 503 will compare All errors in the upper reference genome are aligned to bases that are identical to the reference genome.
  • the graph construction unit 411 constructs a Debron map after cutting the optimized sequencing sequence into an N-mer; the cutting unit 412 outputs a partial ring structure in the Debrunn diagram, and the cutting of the Debrunn graph becomes more A contig group; the skeleton construction unit 413 constructs a skeleton sequence by using the double-ended relationship obtained by sequencing, and complements the skeleton sequence to obtain a final skeleton sequence.
  • the mutation information filtering unit 431 filters or re-runs the abnormal result of the comparison result containing the variation information; and/or filters the logical error result; and/or removes the common result incomplete, and outputs the filtered comparison result; the mutation information extraction unit 432 extracting variation from the filtered alignment result output from the mutation information filtering unit
  • a code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements.
  • the code can be located on a computer readable medium.
  • the computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium.
  • the computer readable medium can also include a carrier wave that encodes the data signal.
  • the whole genome sequencing results were assembled to obtain a skeleton sequence, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome was obtained with high accuracy.
  • the actual data shows that the method of the embodiment of the present invention can exhibit excellent accuracy between genomes of 1M-3G.
  • a cluster of candidate structural variations is obtained by analyzing the results of genome-wide sequencing assembly, resulting in more comprehensive results.
  • This candidate structural variation set can be analyzed in the next step.
  • the present invention performs a variety of other methods for verifying the genome-wide candidate structural variation set, and obtains a structural variation set of less than 10% of false positives, with a low positive.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Description

一种基因组结构性变异检测方法和*** 技术领域
本发明涉及生物信息学技术领域, 尤其涉及一种基因组结构 性变异(Structure Variation, SV )检测方法和***。 背景技术
结构性变异在基因组中有重要的地位, 结构性变异可能导致 个体基因编码改变和功能改变。 随着人类基因组计划和国际单体 型图计划的顺利完成, 生物学家通过遗传连锁或关联分析已经定 位了大量与人类疾病相关的基因组候选区域。 但是, 识别这些区 域中的致病基因或突变需要对这些区域进行重新测序。 现有的全 基因组重测序分析技术成本较高, 而且通过全基因组重测序分析 技术得到的信息对于部分研究和个体医疗指导来说包含大量冗余 信息。 为了提高获得有效信息的效率, 将现有基因分析技术集中 在高价值的基因研究区域对于科学研究和医疗指导具有重大意 义。 传统的基于 PCR ( Polymerase Chain Reaction, 聚合 1½反 应)来对候选区域进行测序的方法由于耗时耗力已经无法满足研 究者的要求, 同时基于基因芯片的 SNP ( Single Nucleotide Polymorphism, 单核苷酸多态性)分型技术又无法找出基因组上 的稀有变异。
随着新一代高通量测序技术的出现以及测序成本的降低, 如 Solexa 测序技术, 迫切需要一种可以对基因組上感兴趣的区域进 行测序从而可以识别该区域上各种突变的技术。 发明内容
本公开的一个方面要解决的一个技术问题是提供一种基因組 结构性变异检测方法, 准确性更高。
本公开的一个方面提供一种基因组结构性变异检测方法, 包 括:
组装步骤, 将测序序列组装成骨架序列 (scaffold );
比对步骤, 将骨架序列对参考基因组进行全局两两比对, 获 得含有变异信息的比对结果;
提取步骤, 从含有变异信息的比对结果中提取变异信息。 根据本公开的一个方面, 在组装步骤之前, 还包括: 优化步骤, 将测序序列通过比对参考基因组进行优化处理获 得优化的测序序列;
组装步骤包括: 将优化的测序序列组装成骨架序列。
根据本公开的一个方面, 在提取步骤之后, 还包括: 验证步驟, 对提取的变异信息进行验证以去除未通过验证的 变异信息。
根据本公开的一个方面, 验证步骤包括:
对于变异信息中长度大于等于 50bp 的变异, 判断重复性是 否小于 10 % , 如果是, 则构建变异序列, 将测序序列比对上变异 序列, 如果变异序列的深度符合逻辑理论分布, 则通过验证, 否 则未通过验证, 去除变异; 如果重复性大于等于 10 %, 则判断变 异位点延伸序列是否无重复性, 如果是, 则构建变异序列, 把测 序序列比对上变异序列, 延伸序列比对深度特征符合逻辑理论分 布则通过验证, 否则去除;
对于变异信息中长度小于 50bp 的变异, 构建变异序列, 通 过短序列比对工具对测序序列和变异序列进行间隙比对, 如果比 对结果符合逻辑理论比对结果, 则通过验证, 否则未通过验证, 去除变异。
根据本公开的一个方面, 提取步骤还包括: 对含有变异信息的比对结果进行如下处理:
过滤或重新运行异常结果; 和 /或
过滤逻辑错误结果; 和 /或
去除常见结果不完整。
根据本公开的一个方面, 优化步骤包括:
通过短序列比对工具将测序序列比对参考基因组获得比对序 列;
优化步骤还包括:
通过短序列比对工具去除重复测序序列;
和 /或
将比对上参考基因组的所有错误比对碱基置换成与参考基因 组一致的碱基;
和 /或
去除比对序列中平均质量低于预定值的测序序列。
根据本公开的一个方面, 组装步骤包括:
将测序序列切成 N-mer后构建德布鲁恩图;
根据德布鲁恩图输出重叠群(contig )和杂合序列;
运用测序得到的双端关系根据重叠群构建骨架序列; 对骨架序列进行补缺口得出最后的骨架序列。
通过本公开实施例的方法, 对全基因组测序结果进行组装获 得骨架序列, 和参考基因组进行对比, 得出与参考基因组无关的 个人特有基因组, 准确性高。
本公开的另一个方面要解决的一个技术问题是提供一种基因 组结构性变异检测***, 准确性更高。
本公开的一个方面提供一种基因组结构性变异检测***, 包 括:
组装装置, 用于将测序序列组装成骨架序列 (scaffold ); 比对装置, 用于将骨架序列对参考基因组进行全局两两比 对, 获得含有变异信息的比对结果;
提取装置, 用于从含有变异信息的比对结果中提取变异信 息。
才艮据本公开的一个方面, 该***还包括:
优化装置, 用于将测序序列通过比对参考基因组进行优化处 理获得优化的测序序列;
组装装置用于将优化的测序序列组装成骨架序列。
根据本公开的一个方面, 该***还包括:
验证装置, 用于对提取的变异信息进行验证, 去除未通过验 证的变异信息。
根据本公开的一个方面, 验证装置对于变异信息中长度大于 等于 50bp的变异, 判断重复性是否小于 10 %, 如果是, 则构建 变异序列, 将测序序列比对上变异序列, 如果变异序列的深度符 合逻辑理论分布, 则通过验证, 否则未通过验证, 去除变异; 如 果重复性大于等于 10 % , 则判断变异位点延伸序列是否无重复 性, 如果是, 则构建变异序列, 把测序序列比对上变异序列, 延 伸序列比对深度特征符合逻辑理论分布则通过验证, 否则去除; 对于变异信息中长度小于 50bp 的变异, 构建变异序列, 通过短 序列比对工具对测序序列和变异序列进行间隙比对, 如果比对结 果符合逻辑理论比对结果, 则通过验证, 否则未通过验证, 去除 变异。
根据本公开的一个方面, 提取装置包括:
变异信息过滤单元, 用于对含有变异信息的比对结果进行过 滤或重新运行异常结果; 和 /或过滤逻辑错误结果; 和 /或去除常 见结果不完整, 输出过滤后的比对结果;
变异信息提取单元, 用于从变异信息过滤单元输出的过滤后 的比对结果提取变异信息。
根据本公开的一个方面, 优化装置包括:
对比单元, 用于将测序序列比对参考基因组得到比对序列; 过滤单元, 用于对比对序列进行过滤, 去除比对结果中平均 质量低于预定值的序列;
错误碱基置换单元, 用于将比对上参考基因组的所有错误比 对 置换成与参考基因组一致的 。
根据本公开的一个方面, 组装装置包括:
图构建单元, 用于将优化的测序序列切成 N-mer后构建德布 鲁恩图;
切割单元, 用于对德布鲁恩图中的环状结构进行输出, 切割 该德布鲁恩图变成多条重叠群(contig )和杂合序列;
骨架构建单元, 用于运用测序得到的双端关系根据多条重叠 群构建骨架序列, 对骨架序列进行补缺口得出最后的骨架序列。
本公开基因组结构性变异检测***的实施例, 通过组装装置 对全基因组测序结果进行组装获得骨架序列, 通过比对装置将骨 架序列和参考基因组进行全局对比, 得出与参考基因组无关的个 人特有基因组, 准确性高。 附图说明
图 1 示出本发明的基因组结构性变异检测方法的一个实施例 的流程图;
图 2 示出本发明的基因組结构性变异检测方法的另一个实施 例的流程图;
图 3 示出本发明的基因组结构性变异检测方法的又一个实施 例的流程图;
图 4示出本发明的基因组结构性变异检测***的一个实施例 的结构图;
图 5 示出本发明的基因组结构性变异检测***的另一个实施 例的结构图;
图 6 示出本发明的基因组结构性变异检测***的又一个实施 例的结构图。 具体实施方式
下面参照附图对本发明进行更全面的描述, 其中说明本发明 的示例性实施例。
基于组装检测结构性变异的方法和***是一种对基因组 DNA 序列信息进行一系列生物信息分析的方法和进行相关分析的工 具, 旨在解决基因组生物信息学分析方法和工具不完善的问题。
图 1 示出本发明的基因组结构性变异检测方法的一个实施例 的流程图。
步骤 102, 组装步骤。 将测序序列组装成骨架序列 ( scaffold )。 例如, 把测序序列切成 N-mer后构建德布鲁恩图, 对德布鲁恩图中的部分环状结构进行输出, 同时切割该德布鲁恩 图变成多条重叠群(contig ), 和杂合序列; 运用测序得到的双端 关系对重叠群进行处理构建骨架序列。 通过处理带缺口的骨架序 列, 对骨架序列用碱基 "N" 进行补缺口, 得到最后的骨架序 列。
步骤 104, 比对步骤。 将骨架序列对参考基因组进行全局两 两比对, 获得含有变异信息的比对结果。 例如, 对步骤 102得出 的组装结果使用长序列比对软件与参考基因组进行全局两两比 对。 长序列比对软件例如是 LASTZ, 具体介绍可以见参考文献 [ Harris, R.S. Improved pair ise alignment of genomic DNA. PhD thesis, Pennsylvania State University ( 2007 )】。 步骤 106, 提取步骤, 从含有变异信息的比对结果中提取变 异信息。 变异信息包括变异位点的位置, 变异类型, 变异的序列 等信息。
在本发明的上述实施例中, 对全基因组测序结果进行组装获 得骨架序列, 和参考基因组进行对比, 得出与参考基因组无关的 个人特有基因组, 准确性高。
图 2 示出本发明的基因组结构性变异检测方法的另一个实施 例的流程图。
如图 2 所示, 步骤 202, 优化步骤。 将测序序列通过比对参 考基因组后进行优化处理获得优化的测序序列。 通过序列比对工 具进行测序序列和参考基因组的比对获得比对序列, 将比对序列 进行优化处理, 例如去重复、 替换错误碱基和过滤后, 转换成优 化的测序序列。
例如, 通过 BWA软件进行测序序列和参考基因组的比对, BWA具体参数采用 "aln -e O -o O"。 该参数的含义为: "aln" 是 BWA 的子功能, 作用是比对; " -e " 表示能进行间隙比对
( gapped alignment ) 的间隙长度上限; "-o" 表示间隙比对的间 隙个数。 BWA是短序列对比软件, 具体介绍可以参见参考文献
【 Heng Li, Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Nature Bioinformatics. Vol.25 no.14: 1754-1760 ( 2009 )】。
对比序列的去重复处理是指去除一些重复度高的序列区域。 例如, 一个序列区域为 ATCATCATCATCATC, 包含多个 ATC, 将会对比对造成影响, 应当排除这样的序列区域。 比对序 列的替换错误碱基处理为把比对上参考基因组的所有错误比对碱 基置换成跟参考基因组一致的碱基。 比对序列的过滤处理为去除 平均质量值低于预定值 X 的序列; 例如, 参数 X根据测序的平 均质量值设定, 质量值符合公式 Q=-10*lgPe, Pe 为出错概率, 建议取值范围例如是 [10-20】, 对应平均错误率为 [10%-1%], 默 认选项是 15。 通过对测序序列进行优化处理, 可以提高下一步处 理的精度。
步骤 204, 组装步骤。 将优化的测序序列组装成骨架序列。 例如, 采用华大基因研究院研发的软件 Soapdenovo 进行组装, 具体组装参数是 "-K 31", 其中, 参数 " - K" 用于设定切 K- mer的值。 其中 Soapdenovo软件的介绍可以参见参考文献: 【Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res ( 2009 )】。
步驟 206, 比对步骤。 将骨架序列与参考基因组进行全局两 两比对, 得出含有变异信息的比对结果。 例如, 用 LASTZ把骨 架序列对参考基因组进行全局两两比对, 其中 LASTZ 的具体参 数 为 : " ~strand=both —chain — ambiguousn -gapped ― ydrop=20000 -gap=1000,l -noentropy -format=axt" , 对参考 基因组建种子采用 12οΠ9。 参数定义可见 LASTZ软件说明文档, <<-strand=both" 是指正负链都比对, "一 chain" 是指进行链接, "― ambiguousN" 是指把 N作为多种碱基型处理, "~gapped" 是指进行间隙比对, "一 ydrop=20000" 是指间隙比对罚分的阀值 为 20000, "-gap=1000,l" 是指开一个间隙罚分 1000, 延长一个 间隙一个碱基罚分 1 分, "一 noentropy" 是指不引入熵对高精度 结果进行过滤, "-format=axt " 是指结果用 axt 格式输出。
"12οΠ9" 是种子的模式为 12οΠ9。 一个种子为参考序列中按软 件设定规则选取的 19 个碱基长度的序列。 目标序列能否比对上 种子序列只考虑软件设定的种子中的 12 个碱基位置。 如果种子 区域比对上, 比对将以种子区域为起点向两个方向延伸, 直到比 对完成, 输出比对结果。 步骤 208, 提取步骤。 对包含变异信息的比对结果进行过 滤, 提取过滤后的比对结果中的变异信息。 过滤包括: (1 )过滤 或重新运行异常结果, (2 ) 过滤逻辑错误结果, 和 (3 ) 常见结 果不完整。 (1 ) 过滤或重新运行异常结果: 过滤 laste 中运行不 正常的结果, 过滤 lastz 结果中注释的无意义部分, 重新运行没 有正常结尾标识符的 lastz程序。 (2 ) 过滤逻辑错误结果: 这个 包括一条组装序列比对上两条或以上染色体, 一条染色体的同一 个位置比对上两条或者以上的组装序列, 从这些结果中挑选质量 较好的保留之。 (3 ) 比对结果中常常有 N ( ACGT 均有可能) 与 - (比对间隙)成对出现, 同时出现, 应该去除这样的对。
步骤 210, 脸证步骤。 对提取的变异信息进行验证以去除未 通过验证的变异信息。 可以通过各种计算方法对候选的变异信息 进行验证去除未通过验证的变异信息。 例如, 通过深度和序列切 割方法进行验证。 对于长度大于等于 50bp 的变异, 首先构建变 异序列, 然后把测序序列比对上变异序列, 若变异序列的深度符 合逻辑理论分布则通过验证, 否则去除; 对于长度少于 50bp 的 变异, 首先构建变异序列, 然后用序列比对软件例如 BWA对测 序短序列和变异序列进行带间隙比对, 比对参数为 "-e 50 -0 1 - i 5" , 若比对结果符合逻辑理论比对结果则通过验证, 否则去除。 最后合并两者得出最终结果。 上述变异序列的深度符合逻辑理论 分布是指, 如果目标序列跟参考序列一致, 应该在该区域的各个 点的深度会有比较高的值, 而且每个点的深度都比较接近, 反 之, 则值比较低。
需要指出, 优化步骤和验证步骤作为本发明实施例的可选步 骤, 可以包含其中一个或者两个。
在上述实施例中, 通过对测序序列进行优化处理, 可以提高 下一步处理的精度。 对全基因组候选结构性变异集合进行多种方 法进行验证, 去除未通过验证的变异信息, 从而使得变异信息的 假阳性低。 通过实验表明, 本发明实施例的方法可以得出假阳性
10%以下的结构性变异集合。
图 3 示出本发明的基因组结构性变异检测方法的又一个实施 例的流程图。
如图 3所示, 在步驟 301, BWA比对。 通过 BWA软件进行 测序序列和参考基因组的比对, 获得对比序列。
在步骤 302, BWA去重复。 通过 BWA软件去除重复度高的 序列。
在步骤 303, 把错误比对碱基置换成参考序列碱基和根据质 量值过滤。 把比对上参考基因组的所有错误比对碱基置换成跟参 考基因组一致的碱基, 去除平均质量值低于预定值 X的序列。
在步驟 304, 生成拼接的德布鲁恩图。
在步骤 305, 根据德布鲁恩图输出重叠群和杂合序列。
在步驟 306, 获得重叠群或杂合序列。 后续步骤 307 至步驟 309分别对重叠群和杂合序列进行处理。
在步骤 307, 切分参考序列和拼接结果序列, 该处结果序列 指重叠群和杂合序列。
在步骤 308, 拆分成多份的两两比对。 将参考序列和结果序 列拆分成多份, 然后分别用一个来自参考序列的拆分过的小序列 跟来自结果序列的小序列比对, 直到所有小序列比对完。
在步骤 309, 纠比对 «, 去逻辑 ^, 输出变异信息。 在步骤 310, 获取变异信息。
在步骤 311, 判断变异长度是否大于等于 50bp碱基对, 如果 是, 则继续步骤 312, 否则, 继续步骤 317
在步骤 312, 计算序列重复性。 比较该序列的某个区域的信 息与重复序列库中的信息, 判断是否一致; 若一致就判断该序列 区域为重复序列区域。 也可能整奈序列都为重复序列区域。 通过 计算重复序列区域的长度跟整条序列的比例, 就能算出序列重复 性。
在步骤 313, 判断重复性是否少于 10 % , 如果是, 则继续步 骤 316, 否则, 继续步驟 314。
在步骤 314, 判断变异位点延伸序列是否无重复性, 如果 是, 则根继续步骤 315。
在步骤 315, 得出变异序列, 与参考序列进行比对。 根据延 伸序列比对深度特征得出^ i序列, 输出变异结果。
在步驟 316, 得出变异序列, 与参考序列进行比对。 如果变 异序列正确, 该变异序列的比对深度会比较高, 且比较平均。 根 据深度比得出 变异, 输出变异结果。
在步骤 317, 得出变异序列。
在步骤 318, 获得带间隙的单端或双端 BWA 比对结果。 测 序序列分两种, 一种是单端 (single-end ), —种是双端 (pair- end ), BWA 比对的时候不同种类使用的方法不一样。 具体可以 参见: http://bio-bwa.sourceforge.net/bwa.shtml.
在步驟 319, 提取变异位点附近序列。 每个变异位点会有位 置信息, 在参考序列中找到这个位置, 把这个位置的前后一定长 度的序列截取下来跟这个变异位点的变异序列连接起来, 变成一 个新的序列。
在步驟 320, 带间隙的 BWA 比对。 BWA 比对时使用 -o 1 参数, 允许目标序列与参考序列比对时存在间隙, 或不存在间 隙。
在步骤 322, 根据比对结果的间隙情况和深度分布得出验证 变异, 输出变异结果。
下面介绍本发明方法的多个应用例。 应用例一, 人类外显子捕捉测序。
以国际人类基因组单体型图计划个体 NA12156 外显子测序 为 例 ( 样 品 号 : NA12156 ; 下 载 地 址 ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX005/SRX005923 )。 原始数 据, 共 11346285条短序列。
将人类外显子 NA12156的测序结果用基础软件 BWA工具和 过滤程序软件对测序结果基于参考基因组进行过滤和优化; 将过 滤优化得出的序列用 soapdenovo 的进行组装; 将组装结果使用 软件 LASTZ软件与参考基因组进行两两比对, 比对结果用 提取结构性变异信息软件进行过滤及去除异常结果, 最后采用验 证结构性变异软件通过深度和序列切割方法进行验证。 对于长度 大于等于 50bp的变异, 判断重复性是否少于 10 % , 如果是, 则 构建变异序列, 把测序序列比对上变异序列, 若变异序列的深度 符合逻辑理论分布, 则通过验证, 否则去除; 如果重复性大于等 于 10 %, 则判断变异位点延伸序列是否无重复性, 如果是, 构建 变异序列, 然后把测序序列比对上变异序列, 延伸序列比对深度 特征符合逻辑理论分布则通过验证, 否则去除; 对于长度少于 50bp的变异, 构建变异序列, 然后用 BWA进行带间隙比对, 比 对参数为 -e 50 -0 1 -i 5, 若比对结果符合逻辑理论比对结果则 通过验证, 否则去除。 最后合并两者得出最终结果。 具体步骤如 下:
第一步, 优化步骤
对短序列进行优化处理(对比、 去重复、 替换、 过滤)后, 得到 9303954条短序列。
第二步, 组装步骤
对优化的短序列进行组装, 组装结果基因组大小为 218030396bp, 有 3941732 条组装序列, 组装序列最长为 9042bp, N50 为 298bp 和 N90为 122bp。
第三步, 比对步骤
组装序列与参考基因组比对结果含有 64696911 对比对结 果。
第四步, 提取步骤
候选 SV结果有 37014个, 大于 50bp有 5695, 少于 50bp 有 31253个。
第五步, 验证步骤
被验证的基因组变异结果有 3294 个, 其中在捕捉区域的有
425个。 其中前 9个 SV如下表 1所示:
Figure imgf000014_0001
表 1
应用例二, 人类外显子捕捉测序。
该应用例以结肠癌癌变细胞的外显子测序为例 (样品号: yv 090508 )。 原始数据共 105972839条短序列 (测序序列)。
第一步, 优化步驟
对短序列进行优化处理(对比、 去重复、 替换、 过滤)后, 共 69549590条短序列。
第二步, 组装步骤
对优化的短序列进行组装, 组装结果基因组大小为 118938172bp , 有 253868 条组装序列, 组装序列最长为 16885bp, N50为 793bp 和 N90为 170bp。
第三步, 比对步骤
组装序列与参考基因组比对结果含有 11882543 对比对结 果。
第四步, 提取步骤
候选 SV结果有 57433个, 大于 50bp有 12056, 少于 50bp 有 45377个。
第五步, 验证步骤
被验证的 SV有 9377个, 其中在捕捉区域的有 91个, 其中 前 13个 SV如下表 2所示:
组装序列 ID 变异类型 组 装 组装序 参考基因组 染色体开始 染色体终止 变异基因 序 列 列 ID终 染色体号 型
ID 起' 止位置
始 位
1811143 Insertion 36 38 chr7 65479979 65479979 AT
1833167 Insertion 261 264 chr3 126129396 126129396 AAG
1837575 Deletion 142 142 chrl7 71800160 71800163 TGA
1848441 Insertion 17 20 chr3 46476289 46476289 CTT
1850771 Insertion 338 341 chr21 46546414 46546414 TGG 1852777 Deletion 343 343 chr7 15692332 15692335 TGG
1874031 Insertion 83 86 chrl6 88444381 88444381 GAG
1874671 Deletion 410 410 chr7 143288092 143288093 G
1881421 Insertion 368 371 chrl7 15284250 15284250 CTT
1883581 Deletion 215 215 chr6 160480888 160480896 TGGTAA
GT
1887101 Deletion 146 146 chrl9 1776928 1776931 CTC
1891753 Deletion 139 139 chrl 52078652 52078655 TCT
1896823 Deletion 363 363 chrl9 59367581 59367584 CCC
表 2
应用例三, 微生物测序。
该应 用 例 以 一株副 溶血弧菌 为 例 ( 样本号 : VIBydvDlOpoolingIAAPEI-9-l )。 原始数据共 5631982 条短序 列。
第一步, 优化步骤
对短序列进行优化处理(对比、 去重复、 替换、 过滤)后, 共 5213412条短序列。
第二步, 组装步驟
组装结果基因组大小为 5056512 bp, 有 684条组装序列, 组 装序列最长为 94989bp, N50为 23988 bp 和 N90为 5603bp。
第三步, 比对步骤
组装序列与参考基因组比对结果含有 1442对比对结果。
第四步, 提取步骤
候选 SV结果有 725个, 大于 50bp有 196, 少于 50bp有 529个。
第五步, 验证步骤
被^ i£的 SV有 180个, 其中前 19个如表 3所示: 组装序列 ID 变异类型 组装序 组 装 参考基因组染色体号 染 色体 染 色体 变 异 列 ID起 序 列 开始 终止 基 因 始位置 ID 终 型 止 位
SOOl— 4988 Deletion 623 623 Vibrio_parahaemolyticu 281201 281202 T s RIMD 1
SOOl一 4998 Deletion 164 164 Vibrio_parahaemolyticu 1336020 1336024 ATGT s RIMD 1
SOOl— 5000 Insertion 231 234 Vibrio_parahaemolyticu 949303 949303 AAA s RIMD 1
SOOl— 5030 Deletion 536 536 Vibrio_parahaemolyticu 1090795 1090796 T s RIMD 1
SOOl— 5176 Insertion 2626 2627 Vibrio— parahaemolytku 1499322 1499322 C s RIMD 1
SOOl一 5188 Deletion 2335 2335 Vibrio_parahaemolyticu 723095 723096 A s RIMD 1
SOOl— 5240 Deletion 98 98 Vibrio_parahaemolyticu 680139 680140 A s RIMD 1
SOOl一 5260 Deletion 1853 1853 Vibrio_parahaemolyticu 6768 IS 676816 A s RIMD 1
SOOl— 5348 Insertion 855 856 Vibrio_parahaemolyticu 1335062 1335062 G s RIMD 1
SOOl— 5360 Insertion 5675 5676 Vibrio_parahaemolyticu 341442 341442 T s RIMD 1 S001_5364 Deletion 35 35 Vibrio_parahaemolyticu 962113 962114 T s RI D 1
S001_5384 Insertion 5462 5463 Vibrio— parahaemolyticu 312520 312520 A s RIMD 1
S001_5388 Deletion 6105 6105 Vibrio_parahaemolyticu 667732 667733 A s RIMD 1
S001—5398 Deletion 6585 6585 Vibrio一 parahaemolyticu 996693 996694 A s RIMD 1
S001_5408 Deletion 1482 1482 Vibrio— parahaemolyticu 128682 128683 T s RIMD 1
S001—5426 Insertion 5406 5407 Vibrio— parahaemolyticu 71294 71294 T s RIMD 1
S001_5436 Deletion 8239 8239 Vibrio— parahaemolyticu 50680 50684 ACAT s RIMD 1
S001_5436 Deletion 8185 8185 Vibrio一 parahaemolyticu 50738 50739 A s RIMD 1
S001_5436 Deletion 49 49 Vibrio— parahaemolyticu 58875 58877 AT s RIMD 1
表 3 图 4示出本发明的基因组结构性变异检测***的一个实施例 的结构图。 如图 4所示, 该实施例的结构性变异检测*** 400包 括组装装置 41、 比对装置 42和提取装置 43。 其中, 组装装置 41 将测序序列组装成骨架序列 (scaffold ), 输出骨架序列; 比对装 置 42将组装装置 41输出的骨架序列对参考基因组进行全局两两 比对获得含有变异信息的比对结果; 提取装置 43 从含有变异信 息的比对结果中提取变异信息。 在上述实施例中, 通过组装装置对全基因组测序结果进行组 装获得骨架序列, 通过比对装置将骨架序列和参考基因组进行全 局对比, 得出与参考基因组无关的个人特有基因组, 准确性高。
图 5 示出本发明的基因组结构性变异检测***的另一个实施 例的结构图„ 和图 4相比, 该实施例的结构性变异检测*** 400 还可选地包括优化装置 50和验证装置 54。 优化装置 50将测序序 列通过比对参考基因组进行优化处理获得优化的测序序列, 将优 化的测序序列发送给组装装置 41。 组装装置 41 将优化的测序序 列组装成骨架序列 ( scaffold )。 例如, 优化装置 50通过短序列比 对软件将测序序列和参考基因组进行比对, 获得比对序列, 然后 对比对序列进行去重复、 替换、 过滤等优化处理, 获得优化的测 序序列。
验证装置 54 对提取的变异信息进行验证, 去除未通过验证 的变异信息。 验证装置 54 可以通过各种计算方法对候选的变异 信息进行验证去除未通过验证的变异信息, 例如, 通过深度和序 列切割方法进行验证。 根据本发明的一个实施例, 验证装置对于 变异信息中长度大于等于 50bp的变异, 判断重复性是否小于 10 % , 如果是, 则构建变异序列, 将测序序列比对上变异序列, 如 果变异序列的深度符合逻辑理论分布, 则通过验证, 否则未通过 臉证, 去除变异; 如果重复性大于等于 10 % , 则判断变异位点延 伸序列是否无重复性, 如果是, 则构建变异序列, 把测序序列比 对上变异序列, 延伸序列比对深度特征符合逻辑理论分布则通过 验证, 否则去除; 对于变异信息中长度小于 50bp 的变异, 构建 变异序列, 通过短序列比对工具对测序序列和变异序列进行间隙 比对, 如果比对结果符合逻辑理论比对结果, 则通过验证, 否则 未通过验证, 去除变异。
在上述实施例中, 通过优化装置对测序序列进行优化处理, 可以提高下一步处理的精度。 通过验证装置对全基因组候选结构 性变异集合进行多种方法进行验证, 去除未通过验证的变异信 息, 从而使得变异信息的假阳性低。 通过实验表明, 本发明实施 例的方法可以得出假阳性 10%以下的结构性变异集合。
图 6 示出本发明的基因组结构性变异检测***的又一个实施 例的结构图。 如图 6 所示, 在该实施例的结构性变异检测*** 600中, 优化装置 50包括对比单元 501、 过滤单元 502和错误碱 基置换单元 503。 组装装置 41 包括图构建单元 411、 切割单元 412 和骨架构建单元 413。 提取装置 43 包括变异信息过滤单元 431和变异信息提取单元 432。
其中, 对比单元 501 将测序序列比对参考基因组得到比对序 列; 过滤单元 502 用于对比对序列进行过滤, 去除比对队列中平 均质量低于预定值的序列; 错误碱基置换单元 503将比对上参考 基因组的所有错误比对碱基置换成与参考基因组一致的碱基。 图 构建单元 411将优化的测序序列切成 N-mer后构建德布鲁恩图; 切割单元 412对德布鲁恩图中的部分环状结构进行输出, 切割该 德布鲁恩图变成多条重叠群(contig ); 骨架构建单元 413运用测 序得到的双端关系构建骨架序列, 对骨架序列进行补缺口得出最 后的骨架序列。 变异信息过滤单元 431对含有变异信息的比对结 果进行过滤或重新运行异常结果; 和 /或过滤逻辑错误结果; 和 / 或去除常见结果不完整, 输出过滤后的比对结果; 变异信息提取 单元 432从变异信息过滤单元输出的过滤后的比对结果提取变异
^息。
对于图 4至图 6中各个装置或单元的功能, 可以参考上文中 关于本发明方法的实施例中对应部分的说明, 为简洁起见, 在此 不再详述。
本领域的技术人员应当理解, 对于图 4 至图 6 中的各个装 01409
置, 可以通过单独的计算处理设备实现, 或者将其集成为一个独 立的设备实现。 在图 4至图 6中用框示出以说明它们的功能。 这 些功能块可以用硬件、 软件、 固件、 中间件、 微代码、 硬件描述 语音或者它们的任意组合来实现。 举例来说, 一个或者两个功能 块都可以利用运行在微处理器、 数字信号处理器 (DSP )或任何 其他适当计算设备上的代码实现。 代码可以表示过程、 功能、 子 程序、 程序、 例行程序、 子例行程序、 模块或者指令、 数据结构 或程序语句的任意组合。 代码可以位于计算机可读介质中。 计算 机可读介质可以包括一个或者多个存储设备, 例如, 包括 RAM 存储器、 闪存存储器、 ROM 存储器、 EPROM 存储器、 EEPROM存储器、 寄存器、 硬盘、 移动硬盘、 CD-ROM或本领 域公知的其他任何形式的存储介质。 计算机可读介质还可以包括 编码数据信号的载波。
本领域技术人员将意识到硬件、 固件和软件配置在这些情况 下的可替换性, 以及如何最好地实现每个特定应用地该功能。
在本发明的上述实施例中, 对全基因组测序结果进行组装获 得骨架序列, 和参考基因组进行对比, 得出与参考基因组无关的 个人特有基因组, 准确性高。 实猃数据表明, 本发明实施例的方 法在基因组大小为 1M-3G 之间均可表现出极佳的准确性。 此 外, 通过对全基因组测序组装结果进行分析得出候选结构性变异 集合, 使得结果更加全面。 该候选结构性变异集合, 可以进行下 一步分析。 本发明对全基因组候选结构性变异集合进行多种其他 方法进行验证, 得出假阳性 10%以下的结构性变异集合, 阳性 低。
本发明的描述是为了示例和描述起见而给出的, 而并不是无 遗漏的或者将本发明限于所公开的形式。 很多修改和变化对于本 领域的普通技术人员而言是显然的。 选择和描述实施例是为了更 好说明本发明的原理和实际应用, 并且使本领域的普通技术人员 能够理解本发明从而设计适于特定用途的带有各种修改的各种实 施例。

Claims

权 利 要 求
1. 一种基因组结构性变异检测方法, 其特征在于, 包括: 组装步骤, 将测序序列组装成骨架序列 (scaffold );
比对步骤, 将所述骨架序列对参考基因组进行全局两两比对, 获 得含有变异信息的比对结果;
提取步骤, 从所述含有变异信息的比对结果中提取变异信息。
2. 根据权利要求 1 所述的基因组结构性变异检测方法, 其特 征在于, 在所述组装步骤之前, 还包括:
优化步骤, 将测序序列通过比对参考基因组进行优化处理获得优 化的测序序列;
所述组装步骤包括:
将所述优化的测序序列组装成骨架序列。
3. 根据权利要求 1 所述的基因组结构性变异检测方法, 其特 征在于, 在所述提取步驟之后, 还包括:
验证步驟, 对所述提取的变异信息进行验证以去除未通过验证的 变异信息。
4. 根据权利要求 3 所述的基因组结构性变异检测方法, 其特 征在于, 所述验证步骤包括:
对于所述变异信息中长度大于等于 50bp 的变异, 判断重复性是 否小于 10 % , 如果是, 则构建变异序列, 将所述测序序列比对上 所述变异序列, 如果所述变异序列的深度符合逻辑理论分布, 则 通过验证, 否则未通过验证, 去除所述变异; 如果重复性大于等 于 10 % , 则判断变异位点延伸序列是否无重复性, 如果是, 则构 建变异序列, 把所述测序序列比对上所述变异序列, 延伸序列比 对深度特征符合逻辑理论分布则通过验证, 否则去除;
对于所述变异信息中长度小于 50bp 的变异, 构建变异序列, 通 过短序列比对工具对所述测序序列和所述变异序列进行间隙比 对, 如果比对结果符合逻辑理论比对结果, 则通过验证, 否则未 通过验证, 去除所述变异。
5. 根据权利要求 1 所述的基因组结构性变异检测方法, 其特 征在于, 所述提取步骤还包括:
对所述含有变异信息的比对结果进行如下处理:
过滤或重新运行异常结果; 和 /或
过滤逻辑错误结果; 和 /或
去除常见结果不完整。
6. 根据权利要求 2 所述的基因组结构性变异检测方法, 其特 征在于, 所述优化步骤包括:
通过短序列比对工具将测序序列比对参考基因组获得比对序列; 所述优化步骤还包括:
通过短序列比对工具去除重复测序序列;
和 /或
将比对上参考基因组的所有错误比对碱基置换成与参考基因组一 致的碱基;
和 /或
去除所述比对序列中平均质量低于预定值的测序序列。
7. 根据权利要求 1 所述的基因组结构性变异检测方法, 其特 征在于, 将所述组装步驟包括:
将所述测序序列切成 N-mer后构建德布鲁恩图;
根据所述德布鲁恩图输出重叠群( contig )和杂合序列; 运用测序得到的双端关系才 据重叠群构建骨架序列;
对骨架序列进行补缺口得出最后的骨架序列。
8. 一种基因组结构性变异检测***, 其特征在于, 包括: 组装装置, 用于将测序序列组装成骨架序列 (scaffold ); 比对装置, 用于将所述骨架序列对参考基因组进行全局两两比 对, 获得含有变异信息的比对结果;
提取装置, 用于从所述含有变异信息的比对结果中提取变异信 息。
9. 根据权利要求 8 所述的基因组结构性变异检测***, 其特 征在于, 还包括:
优化装置, 用于将测序序列通过比对参考基因组进行优化处理获 得优化的测序序列;
所述组装装置用于将所述优化的测序序列组装成骨架序列。
10. 根据权利要求 8所述的基因组结构性变异检测***, 其特 征在于, 还包括:
验证装置, 用于对所述提取的变异信息进行验证, 去除未通过验 证的变异信息。
11. 根据权利要求 10 所述的基因组结构性变异检测***, 其 特征在于, 所述验证装置对于所述变异信息中长度大于等于 50bp 的变异, 判断重复性是否小于 10 % , 如果是, 则构建变异序列, 将所述测序序列比对上所述变异序列, 如果所述变异序列的深度 符合逻辑理论分布, 则通过验证, 否则未通过验证, 去除所述变 异; 如果重复性大于等于 10 %, 则判断变异位点延伸序列是否无 重复性, 如果是, 则构建变异序列, 把所述测序序列比对上所述 变异序列, 延伸序列比对深度特征符合逻辑理论分布则通过验 证, 否则去除; 对于所述变异信息中长度小于 50bp 的变异, 构 建变异序列, 通过短序列比对工具对所述测序序列和所述变异序 列进行间隙比对, 如果比对结果符合逻辑理论比对结果, 则通过 验证, 否则未通过验证, 去除所述变异。
12. 根据权利要求 8所述的基因组结构性变异检测***, 其特 征在于, 所述提取装置包括: 变异信息过滤单元, 用于对所述含有变异信息的比对结果进行过 滤或重新运行异常结果; 和 /或过滤逻辑错误结果; 和 /或去除常 见结果不完整, 输出过滤后的比对结果;
变异信息提取单元, 用于从所述变异信息过滤单元输出的过滤后 的比对结果提取变异信息。
13. 根据权利要求 9所述的基因组结构性变异检测***, 其特 征在于, 所述优化装置包括:
对比单元, 用于将所述测序序列比对参考基因组得到比对序列; 过滤单元, 用于对所述比对序列进行过滤, 去除所述比对结果中 平均质量低于预定值的序列;
错误碱基置换单元, 用于将比对上参考基因组的所有错误比对碱 基置换成与参考基因组一致的 ½。
14. 根据权利要求 8所述的基因组结构性变异检测***, 其特 征在于, 所述组装装置包括:
图构建单元, 用于将所述优化的测序序列切成 N-mer后构建德布 鲁恩图;
切割单元, 用于对所述德布鲁恩图中的环状结构进行输出, 切割 该德布鲁恩图变成多条重叠群(contig )和杂合序列;
骨架构建单元, 用于运用测序得到的双端关系根据多条重叠群构 建骨架序列, 对骨架序列进行补缺口得出最后的骨架序列。
PCT/CN2010/001409 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和*** WO2012034251A2 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201080068345.0A CN103080333B (zh) 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和***
PCT/CN2010/001409 WO2012034251A2 (zh) 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和***

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/001409 WO2012034251A2 (zh) 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和***

Publications (1)

Publication Number Publication Date
WO2012034251A2 true WO2012034251A2 (zh) 2012-03-22

Family

ID=45832006

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/001409 WO2012034251A2 (zh) 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和***

Country Status (2)

Country Link
CN (1) CN103080333B (zh)
WO (1) WO2012034251A2 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093121A (zh) * 2012-12-28 2013-05-08 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN103258145A (zh) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 一种基于De Bruijn图的并行基因拼接方法
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN104968806A (zh) * 2013-02-01 2015-10-07 Sk电信有限公司 提供与基于基因序列的个人标记有关的信息的方法和装置
CN108140070A (zh) * 2015-02-25 2018-06-08 螺旋遗传学公司 多样品差分变异检测
CN109074429A (zh) * 2016-04-20 2018-12-21 华为技术有限公司 基因组变异检测方法、装置及终端
CN110079589A (zh) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 一种精准获得全基因组范围内结构变异的方法
CN112086131A (zh) * 2020-08-18 2020-12-15 西安医学院 一种高通量测序中假阳性变异位点的筛选方法
CN117153248A (zh) * 2023-09-05 2023-12-01 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、***

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714263B (zh) * 2013-12-10 2017-06-13 深圳先进技术研究院 双向多步De Bruijn图的错误双向边识别与去除方法
CN104751015B (zh) * 2013-12-30 2017-08-29 中国科学院天津工业生物技术研究所 一种基因组测序数据序列组装方法
CN104164479B (zh) * 2014-04-04 2017-09-19 深圳华大基因科技服务有限公司 杂合基因组处理方法
WO2016000267A1 (zh) * 2014-07-04 2016-01-07 深圳华大基因股份有限公司 确定探针序列的方法和基因组结构变异的检测方法
CN105483244B (zh) * 2015-12-28 2019-10-22 武汉菲沙基因信息有限公司 一种基于超长基因组的变异检测方法及检测***
BR112018075407A2 (pt) * 2016-06-07 2019-03-19 Illumina, Inc. plataforma de análise genômica para executar uma segmentação de análise de sequência
CN110462063B (zh) * 2017-05-23 2023-06-23 深圳华大生命科学研究院 一种基于测序数据的变异检测方法、装置和存储介质
CN110021359B (zh) * 2017-07-24 2021-05-04 深圳华大基因科技服务有限公司 一种二代和三代序列联合组装结果去冗余的方法和装置
CN110349629B (zh) * 2019-06-20 2021-08-06 湖南赛哲医学检验所有限公司 一种利用宏基因组或宏转录组检测微生物的分析方法
CN111724858B (zh) * 2020-05-14 2024-06-07 东北林业大学 利用软件运行基因组序列比对修补gap的方法
CN111863135B (zh) * 2020-07-15 2022-06-07 西安交通大学 一种假阳性结构变异过滤方法、存储介质及计算设备
CN112289376B (zh) * 2020-10-26 2021-07-06 北京吉因加医学检验实验室有限公司 一种检测体细胞突变的方法及装置
CN112599193A (zh) * 2021-03-02 2021-04-02 北京橡鑫生物科技有限公司 结构变异检测模型、其构建方法和装置
CN115602244B (zh) * 2022-10-24 2023-04-28 哈尔滨工业大学 一种基于序列比对骨架的基因组变异检测方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2523490A1 (en) * 2003-04-25 2004-11-11 Sequenom, Inc. Fragmentation-based methods and systems for de novo sequencing

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258145A (zh) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 一种基于De Bruijn图的并行基因拼接方法
CN103093121A (zh) * 2012-12-28 2013-05-08 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN103093121B (zh) * 2012-12-28 2016-01-27 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN104968806A (zh) * 2013-02-01 2015-10-07 Sk电信有限公司 提供与基于基因序列的个人标记有关的信息的方法和装置
KR101770962B1 (ko) * 2013-02-01 2017-08-24 에스케이텔레콤 주식회사 유전자 서열 기반 개인 마커에 관한 정보를 제공하는 방법 및 이를 이용한 장치
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN108140070A (zh) * 2015-02-25 2018-06-08 螺旋遗传学公司 多样品差分变异检测
CN109074429A (zh) * 2016-04-20 2018-12-21 华为技术有限公司 基因组变异检测方法、装置及终端
CN109074429B (zh) * 2016-04-20 2022-03-29 华为技术有限公司 基因组变异检测方法、装置及终端
CN110079589A (zh) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 一种精准获得全基因组范围内结构变异的方法
CN112086131A (zh) * 2020-08-18 2020-12-15 西安医学院 一种高通量测序中假阳性变异位点的筛选方法
CN112086131B (zh) * 2020-08-18 2024-05-24 西安医学院 一种重测序数据库中假阳性变异位点的筛选方法
CN117153248A (zh) * 2023-09-05 2023-12-01 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、***
CN117153248B (zh) * 2023-09-05 2024-05-07 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、***

Also Published As

Publication number Publication date
CN103080333A (zh) 2013-05-01
CN103080333B (zh) 2015-06-24

Similar Documents

Publication Publication Date Title
WO2012034251A2 (zh) 一种基因组结构性变异检测方法和***
JP6725481B2 (ja) 母体血漿の無侵襲的出生前分子核型分析
US10783984B2 (en) De novo diploid genome assembly and haplotype sequence reconstruction
CN113496760B (zh) 基于第三代测序的多倍体基因组组装方法和装置
CN108350495B (zh) 对分隔长片段序列进行组装的方法和装置
WO2013097257A1 (zh) 一种检验融合基因的方法及***
Wildschutte et al. Discovery and characterization of Alu repeat sequences via precise local read assembly
Zhou et al. Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies
US20190244678A1 (en) Methods, systems and processes of de novo assembly of sequencing reads
CN110021355B (zh) 二倍体基因组测序片段的单倍体分型和变异检测方法和装置
CN110621785B (zh) 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置
Kremer et al. Approaches for in silico finishing of microbial genome sequences
US20150169823A1 (en) String graph assembly for polyploid genomes
Ratan Assembly algorithms for next-generation sequence data
US10424395B2 (en) Computation pipeline of single-pass multiple variant calls
Te Boekhorst et al. Computational problems of analysis of short next generation sequencing reads
WO2019132010A1 (ja) 塩基配列における塩基種を推定する方法、装置及びプログラム
Isakov et al. Deep sequencing data analysis: challenges and solutions
Lapidus Genome sequence databases (overview): sequencing and assembly
WO2013097143A1 (zh) 估计基因组杂合率的方法和装置
WO2013097149A1 (zh) 估计基因组重复序列含量的方法和装置
Lonardi et al. Barcoding-free BAC pooling enables combinatorial selective sequencing of the barley gene space
Girilishena Complete computational sequence characterization of mobile element variations in the human genome using meta-personal genome data
Ning et al. Next‐Generation Sequencing Technologies and the Assembly of Short Reads into Reference Genome Sequences
Mishra et al. Genome assembly and annotation

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080068345.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10857107

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10857107

Country of ref document: EP

Kind code of ref document: A2