CN107679366A - A kind of computational methods of genome mutation data - Google Patents
A kind of computational methods of genome mutation data Download PDFInfo
- Publication number
- CN107679366A CN107679366A CN201710761660.5A CN201710761660A CN107679366A CN 107679366 A CN107679366 A CN 107679366A CN 201710761660 A CN201710761660 A CN 201710761660A CN 107679366 A CN107679366 A CN 107679366A
- Authority
- CN
- China
- Prior art keywords
- indel
- software
- data
- values
- rate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of computational methods of genome mutation data.Select Samtools, GATK, Varscan, Pindel and SOAPIndel software to carry out Indel detections to analogue data, generate original I ndel data, calculate the united F values of each two software, the rule of an optimal selection is established by optimal F values;Indel detections are carried out to testing data using software, are grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.The accuracy, the rate of recovery and F values of result can be improved.
Description
Technical field
The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of calculating of genome mutation data
Method.
Background technology
Variation detection is the basis of genome functions analysis in weight sequencing technologies, therefore whether testing result is accurate directly
The accuracy of impact analysis result.Delivered in recent years in variation detection field in order to make up the deficiency of single software detection result
Some integration algorithms based on multiple softwares.The rate of recovery is improved using software results are merged, utilizes the consistent inspection for extracting software
Survey result and improve accuracy.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of computational methods of genome mutation data.
The present invention analyze Indel size and genome sequence feature to variation testing result accuracy and the rate of recovery
Influence, it is proposed that the optimized algorithm based on optimal F values strategy.
Indel detection algorithms proposed by the present invention are the optimal screening algorithms for integrating multiple software detection results, selection
Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel generation original I ndel data.This
Four kinds of different algorithm detection Indel variations have been respectively adopted in five softwares.
(1) Samtools and GATK (UnifiedGenotyper) is the comparison knot based on sequencing data Yu reference gene group
Fruit, the posterior probability that each loci gene type is calculated using Bayesian statistical model detect Indel.
(2) Pindel is based on read (unmapped reads) data not matched in comparison result, Land use models life
Long algorithm detection insertion deletion variation.
(3) Varscan is the pileup data based on Samtools, is become using stable heuritic approach detection Indel
It is different, and extreme read depth can be handled, mix the problems such as pond sequencing data and sequencing data are contaminated.
(4) SOAPIndel is to be recombinated all reads for not matching using De Bruijn graph algorithms, by and ginseng
Examine genome alignment detection insertion deletion variation.
Optimized algorithm based on optimal F values strategy is as follows:
1) optimization rule is established
Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel software are selected to mould
Intend data and carry out Indel detections, generate original I ndel data, calculate the united F values of each two software, pass through optimal F values
Establish the rule of an optimal selection.
2) Indel is selected according to the principle of optimality
Treated using software Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel
Survey data and carry out Indel detections, be grouped according to DS, RT, SS, ST.Indel is selected according to optimization rule.
The present invention proposes Indel detection algorithms, can improve the accuracy, the rate of recovery and F values of result.
Brief description of the drawings
Fig. 1, F value tendency chart.
Fig. 2, genome mutation data acquisition schematic flow sheet.
Embodiment
The present invention analyze Indel size and genome sequence feature to variation testing result accuracy and the rate of recovery
Influence, it is proposed that the optimized algorithm based on optimal F values strategy.
The selection of one, softwares
Indel detection algorithms proposed by the present invention are the optimal screening algorithms for integrating multiple software detection results, selection
Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel generation original I ndel data.This
Four kinds of different algorithm detection Indel variations have been respectively adopted in five softwares.
(1) Samtools and GATK (UnifiedGenotyper) is the comparison knot based on sequencing data Yu reference gene group
Fruit, the posterior probability that each loci gene type is calculated using Bayesian statistical model detect Indel.
(2) Pindel is based on read (unmapped reads) data not matched in comparison result, Land use models life
Long algorithm detection insertion deletion variation.
(3) Varscan is the pileup data based on Samtools, is become using stable heuritic approach detection Indel
It is different, and extreme read depth can be handled, mix the problems such as pond sequencing data and sequencing data are contaminated.
(4) SOAPIndel is to be recombinated all reads for not matching using De Bruijn graph algorithms, by and ginseng
Examine genome alignment detection insertion deletion variation.
Two, analogue datas
In order to which the sequence of the accuracy of the detailed each software I ndel testing results of research, the rate of recovery and genome is special
Levy to the influence for detecting result, it is necessary to the specifying information of known all variations, including the position of variation, size and residing base
Because of the feature in group region.Known variation is added in reference gene group using computer modeling technique for this present invention and generated newly
Genome sequence, recycle simulation sequencing technologies generation sequencing data.Analogue data is as shown in table 1.
The variation distribution of table 1
Variation type | Size (bp) | Quantity |
SNP | 1 | 1/1000 ratio |
Indel | 1-50 | 2792000 |
Deletion/Insertion | 51-500 | 20000 |
Duplication | 100-500 | 1000 |
Inversion | 100-500 | 1000 |
Translocation | 100-500 | 1000 |
Three, are compared and detection
Sequencing data and soybean reference gene group (William 82) are compared using BWA [Li and Durbin, 2009]
Sam files are generated, sam files are converted into bam files with samtools view, with samtoolssort bam files by seat
Mark sequence simultaneously establishes index with samtools rmdup duplicate removals redoubling samtools index.Then become with five software detections
It is different, Varscan parameter " minimum sequencing depth " is arranged to 2, remaining software uses software default parameter.Finally extract result
Middle 1-50bp Indel.
The criterion of four, consistent results
For the relation mutually supplying and be mutually authenticated between analysis software, it is necessary to clear and definite two software conformance results
Criterion.There is document to propose two standards for the problem, one be overlapped rate more than 50%, another is that have one
Base more than individual is overlapping [Lam et al., 2012].The two standards only consider the overlapping situation of testing result coordinate.But
Because the difference of software algorithm can cause the difference even difference of size of result coordinate.
The present invention is in order to ensure the accuracy of testing result, it is specified that only size is identical can just be determined as same Indel.
The present invention has found by simulated experiment in addition, is had differences for the coordinate of same Indel variation different software testing results,
The reason for causing the deviation is mainly that AT is deleted in sequence similarity, such as sequence ATATAT, and the result of software report is probably
Any one in three.We utilize the grid deviation D between formula below software for calculation result:
D=| P1-P2|
Wherein P1It is Indel1 origin coordinates, P2It is Indel2 origin coordinates.
The statistical result of multiple simulated experiment shows that non repetitive sequence area coordinate deviation range is in soybean gene group
[1,31], it is equal to repetitive sequence length in repetitive sequence area coordinate deviation maximum.
The analysis of the five important Indel attributes of tetra-
Indel has four important attributes --- variation type (ST), variation size (SS), residing repeat region type (RT)
With inspection software (DS).By testing result by this four attribute packets, describe for convenience, the present invention defines G (F, S) and represents collection
The result that S presses attribute F packets is closed, such as G (ST, S) represents that set S is grouped by variation type, G (SS, G (ST, S)) represents set
S is first grouped by variation type, then presses variation size packets again by that analogy.Simulated experiment as shown by data is in packet G (DS, G
(RT, G (SS, G (ST, testing result)))) in, to the Indel in same type repetitive sequence and formed objects, different is soft
Larger difference be present in the accuracy and the rate of recovery of part.Five softwares are deleting the detection of variation just to the 1bp in non repetitive sequence
The distribution of true rate and the rate of recovery, wherein GATK possess the accuracy (99.83%) of maximum and possess the rate of recovery of minimum simultaneously
(41.92%), Varscan possesses the rate of recovery (88.42%) of maximum.This explanation software be influence accuracy of detection it is important because
Element.Same software is also deposited to the accuracy in different type repetitive sequence and different size of Indel detections and the rate of recovery
In larger difference.Four above analytic explanation variation type, variation size, residing repeat region type and inspection software attributes are
An important factor for influenceing Indel detection accuracy and the rate of recovery.
Optimal screening methods of six, based on optimal F values strategy
It can be seen from analysis above, see that the consistent results for extracting multiple softwares can improve accuracy from macroscopic view.But mould
Intend in as shown by data testing result, the accuracy and the rate of recovery that some in G (SS (ST, the consistent testing result of two softwares)) are grouped
All higher, some packet accuracy are high and the rate of recovery is low, the accuracy and the rate of recovery very low even zero of some packets.Therefore it is straight
Engage and the consistent results of each two software can not obtain optimal accuracy.F values are for assessing accuracy and the rate of recovery
The index of balance, F value calculation formula are shown below.
F=2 × p × r/ (p+r)
Wherein p is accuracy, and r is the rate of recovery.
The present invention has found two softwares in G (RT, G (SS, G (ST, the consistent results of two software))) by simulated experiment
The F values of consistent results (IR) have stable changing rule, simultaneously for different packets optimal F values appear in it is different
IR upper (Fig. 1).
G1 is the F values that each IR deletes 1bp in TIR type areas variation testing result in the figure, and G2 is each IR to SSR classes
9bp deletes the F values of variation testing result in type region.G1 optimal F values appear in Samtools and Varscan consistent results
In, G2 optimal F values are appeared in Samtools and SOAPIndel consistent results.
Analysis based on more than, we provide an optimisation strategy directly perceived simply based on the optimal F values of packet:
1. establish optimization rule
Indel inspection softwares are selected, simulate chromosomal variation and sequence.Indel detections are carried out using instrument, calculate every two
The united F values of individual software.The rule of an optimal selection is established by optimal F values.
2. Indel is selected according to the principle of optimality
Indel detections are carried out using software, are grouped according to DS, RT, SS, ST.Selected according to optimization rule
Indel。
Made a variation from entirety, the accuracy (99.32%) of this method is higher than Samtools (97.46%), Pindel
(94.69%), SOAPIndel (97.24%) and Varscan (98.59%), the rate of recovery (65.20%) are higher than
GATKUnifiedGenotyper (25.50%) and Pindel (41.36%).
Screening techniques of seven, based on deep learning
The method of optimal F values is the consistent results based on software, thus can give up only by single software detection to
Indel, and from the Indel knowable to analogue data only by single software detection close to accounting for the 20% of overall quantity, all give up tight
Ghost image rings the rate of recovery.In order to more comprehensively take into account balance using the result of all softwares so as to obtain the higher rate of recovery
Property, the present invention devises the testing result that method based on deep learning (Deep Learning) screens all softwares, we with
All initial data are training set, to detect Indel software used, Indel type, and repetitive sequence type residing for Indel,
The read quantity (coverage) for supporting Indel testing results is training characteristics, and accuracy rate and recall rate are training objective.Utilize instruction
Practicing collection, we can train to obtain one and make the rate of recovery and recall rate model as high as possible.
We carry out the exploitation of deep learning program using TensorLayer, and TensorLayer is built upon Google
Deep learning (Deep Learning) and enhancing study (Reinforcement Learning) software on TensorFlow
Storehouse.
Claims (2)
1. a kind of computational methods of genome mutation data, it is characterised in that process is as follows:
1) optimization rule is established
Samtools, GATK, Varscan, Pindel and SOAPIndel software is selected to carry out Indel detections to analogue data, it is raw
Into original I ndel data, the united F values of each two software are calculated, the rule of an optimal selection are established by optimal F values
Then;
2) Indel is selected according to the principle of optimality
Indel detections, root are carried out to testing data using software Samtools, GATK, Varscan, Pindel and SOAPIndel
It is grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.
2. computational methods according to claim 1, it is characterised in that the F=2 × p × r/ (p+r), wherein p are correct
Rate, r are the rate of recovery.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710761660.5A CN107679366A (en) | 2017-08-30 | 2017-08-30 | A kind of computational methods of genome mutation data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710761660.5A CN107679366A (en) | 2017-08-30 | 2017-08-30 | A kind of computational methods of genome mutation data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679366A true CN107679366A (en) | 2018-02-09 |
Family
ID=61134390
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710761660.5A Pending CN107679366A (en) | 2017-08-30 | 2017-08-30 | A kind of computational methods of genome mutation data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679366A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117106875A (en) * | 2023-10-23 | 2023-11-24 | 中国科学院昆明植物研究所 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810402A (en) * | 2014-02-25 | 2014-05-21 | 北京诺禾致源生物信息科技有限公司 | Data processing method and device for genomes |
CN104298892A (en) * | 2014-09-18 | 2015-01-21 | 天津诺禾致源生物信息科技有限公司 | Detection device and method for gene fusion |
US20160312303A1 (en) * | 2015-04-23 | 2016-10-27 | Quest Diagnostics Investments Incorporated | Methods and compositions for the detection of calr mutations in myeloproliferative diseases |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
-
2017
- 2017-08-30 CN CN201710761660.5A patent/CN107679366A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810402A (en) * | 2014-02-25 | 2014-05-21 | 北京诺禾致源生物信息科技有限公司 | Data processing method and device for genomes |
CN104298892A (en) * | 2014-09-18 | 2015-01-21 | 天津诺禾致源生物信息科技有限公司 | Detection device and method for gene fusion |
US20160312303A1 (en) * | 2015-04-23 | 2016-10-27 | Quest Diagnostics Investments Incorporated | Methods and compositions for the detection of calr mutations in myeloproliferative diseases |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
Non-Patent Citations (2)
Title |
---|
丛培宽: "全基因组外显子测序发现X连锁显性遗传性高度近视疾病的致病基因及人类基因变异数据库LOVD的创建", 《中国博士学位论文全文数据库 信息科技辑》 * |
王春宇: "生物高通量测序片段拼接与分子标记识别算法研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117106875A (en) * | 2023-10-23 | 2023-11-24 | 中国科学院昆明植物研究所 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
CN117106875B (en) * | 2023-10-23 | 2024-02-06 | 中国科学院昆明植物研究所 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rannala et al. | Species delimitation | |
CN107025318A (en) | Method and apparatus for exploring new material | |
CN106611106B (en) | Genetic mutation detection method and device | |
CN109492765A (en) | A kind of image Increment Learning Algorithm based on migration models | |
CN110363344A (en) | Probability integral parameter prediction method based on MIV-GP algorithm optimization BP neural network | |
CN104331642B (en) | Integrated learning method for recognizing ECM (extracellular matrix) protein | |
CN104809476B (en) | A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition | |
CN101866316A (en) | Software defect positioning method based on relative redundant test set reduction | |
CN107680018A (en) | A kind of college entrance will based on big data and artificial intelligence makes a report on system and method | |
CN107451419A (en) | It is a kind of that the method for simplifying DNA methylation sequencing data is produced by computer program simulation | |
CN102254020A (en) | Global K-means clustering method based on feature weight | |
US20070179917A1 (en) | Intelligent design optimization method and system | |
CN109951468A (en) | A kind of network attack detecting method and system based on the optimization of F value | |
CN110990784A (en) | Cigarette ventilation rate prediction method based on gradient lifting regression tree | |
CN105138975A (en) | Human body complexion area segmentation method based on deep belief network | |
Rabier et al. | On the inference of complex phylogenetic networks by Markov Chain Monte-Carlo | |
CN107679366A (en) | A kind of computational methods of genome mutation data | |
CN108320797B (en) | Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database | |
CN106651167A (en) | Biological information engineer skill rating system | |
Manolopoulou et al. | BPEC: An R package for Bayesian phylogeographic and ecological clustering | |
CN109388875A (en) | A kind of design implementation method of elastic element module | |
Zhao et al. | Potato (Solanum tuberosum L.) tuber-root modeling method based on physical properties | |
CN103294828A (en) | Verification method and verification device of data mining model dimension | |
CN111370055A (en) | Intron retention prediction model establishing method and prediction method thereof | |
CN103970973B (en) | A kind of simulation RCT analysis methods based on True Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |
|
RJ01 | Rejection of invention patent application after publication |