CN107679366A - A kind of computational methods of genome mutation data - Google Patents

A kind of computational methods of genome mutation data Download PDF

Info

Publication number
CN107679366A
CN107679366A CN201710761660.5A CN201710761660A CN107679366A CN 107679366 A CN107679366 A CN 107679366A CN 201710761660 A CN201710761660 A CN 201710761660A CN 107679366 A CN107679366 A CN 107679366A
Authority
CN
China
Prior art keywords
indel
software
data
values
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710761660.5A
Other languages
Chinese (zh)
Inventor
袁晓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Ancient Gene Technology Co Ltd
Original Assignee
Wuhan Ancient Gene Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Ancient Gene Technology Co Ltd filed Critical Wuhan Ancient Gene Technology Co Ltd
Priority to CN201710761660.5A priority Critical patent/CN107679366A/en
Publication of CN107679366A publication Critical patent/CN107679366A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of computational methods of genome mutation data.Select Samtools, GATK, Varscan, Pindel and SOAPIndel software to carry out Indel detections to analogue data, generate original I ndel data, calculate the united F values of each two software, the rule of an optimal selection is established by optimal F values;Indel detections are carried out to testing data using software, are grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.The accuracy, the rate of recovery and F values of result can be improved.

Description

A kind of computational methods of genome mutation data
Technical field
The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of calculating of genome mutation data Method.
Background technology
Variation detection is the basis of genome functions analysis in weight sequencing technologies, therefore whether testing result is accurate directly The accuracy of impact analysis result.Delivered in recent years in variation detection field in order to make up the deficiency of single software detection result Some integration algorithms based on multiple softwares.The rate of recovery is improved using software results are merged, utilizes the consistent inspection for extracting software Survey result and improve accuracy.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of computational methods of genome mutation data.
The present invention analyze Indel size and genome sequence feature to variation testing result accuracy and the rate of recovery Influence, it is proposed that the optimized algorithm based on optimal F values strategy.
Indel detection algorithms proposed by the present invention are the optimal screening algorithms for integrating multiple software detection results, selection Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel generation original I ndel data.This Four kinds of different algorithm detection Indel variations have been respectively adopted in five softwares.
(1) Samtools and GATK (UnifiedGenotyper) is the comparison knot based on sequencing data Yu reference gene group Fruit, the posterior probability that each loci gene type is calculated using Bayesian statistical model detect Indel.
(2) Pindel is based on read (unmapped reads) data not matched in comparison result, Land use models life Long algorithm detection insertion deletion variation.
(3) Varscan is the pileup data based on Samtools, is become using stable heuritic approach detection Indel It is different, and extreme read depth can be handled, mix the problems such as pond sequencing data and sequencing data are contaminated.
(4) SOAPIndel is to be recombinated all reads for not matching using De Bruijn graph algorithms, by and ginseng Examine genome alignment detection insertion deletion variation.
Optimized algorithm based on optimal F values strategy is as follows:
1) optimization rule is established
Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel software are selected to mould Intend data and carry out Indel detections, generate original I ndel data, calculate the united F values of each two software, pass through optimal F values Establish the rule of an optimal selection.
2) Indel is selected according to the principle of optimality
Treated using software Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel Survey data and carry out Indel detections, be grouped according to DS, RT, SS, ST.Indel is selected according to optimization rule.
The present invention proposes Indel detection algorithms, can improve the accuracy, the rate of recovery and F values of result.
Brief description of the drawings
Fig. 1, F value tendency chart.
Fig. 2, genome mutation data acquisition schematic flow sheet.
Embodiment
The present invention analyze Indel size and genome sequence feature to variation testing result accuracy and the rate of recovery Influence, it is proposed that the optimized algorithm based on optimal F values strategy.
The selection of one, softwares
Indel detection algorithms proposed by the present invention are the optimal screening algorithms for integrating multiple software detection results, selection Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel generation original I ndel data.This Four kinds of different algorithm detection Indel variations have been respectively adopted in five softwares.
(1) Samtools and GATK (UnifiedGenotyper) is the comparison knot based on sequencing data Yu reference gene group Fruit, the posterior probability that each loci gene type is calculated using Bayesian statistical model detect Indel.
(2) Pindel is based on read (unmapped reads) data not matched in comparison result, Land use models life Long algorithm detection insertion deletion variation.
(3) Varscan is the pileup data based on Samtools, is become using stable heuritic approach detection Indel It is different, and extreme read depth can be handled, mix the problems such as pond sequencing data and sequencing data are contaminated.
(4) SOAPIndel is to be recombinated all reads for not matching using De Bruijn graph algorithms, by and ginseng Examine genome alignment detection insertion deletion variation.
Two, analogue datas
In order to which the sequence of the accuracy of the detailed each software I ndel testing results of research, the rate of recovery and genome is special Levy to the influence for detecting result, it is necessary to the specifying information of known all variations, including the position of variation, size and residing base Because of the feature in group region.Known variation is added in reference gene group using computer modeling technique for this present invention and generated newly Genome sequence, recycle simulation sequencing technologies generation sequencing data.Analogue data is as shown in table 1.
The variation distribution of table 1
Variation type Size (bp) Quantity
SNP 1 1/1000 ratio
Indel 1-50 2792000
Deletion/Insertion 51-500 20000
Duplication 100-500 1000
Inversion 100-500 1000
Translocation 100-500 1000
Three, are compared and detection
Sequencing data and soybean reference gene group (William 82) are compared using BWA [Li and Durbin, 2009] Sam files are generated, sam files are converted into bam files with samtools view, with samtoolssort bam files by seat Mark sequence simultaneously establishes index with samtools rmdup duplicate removals redoubling samtools index.Then become with five software detections It is different, Varscan parameter " minimum sequencing depth " is arranged to 2, remaining software uses software default parameter.Finally extract result Middle 1-50bp Indel.
The criterion of four, consistent results
For the relation mutually supplying and be mutually authenticated between analysis software, it is necessary to clear and definite two software conformance results Criterion.There is document to propose two standards for the problem, one be overlapped rate more than 50%, another is that have one Base more than individual is overlapping [Lam et al., 2012].The two standards only consider the overlapping situation of testing result coordinate.But Because the difference of software algorithm can cause the difference even difference of size of result coordinate.
The present invention is in order to ensure the accuracy of testing result, it is specified that only size is identical can just be determined as same Indel. The present invention has found by simulated experiment in addition, is had differences for the coordinate of same Indel variation different software testing results, The reason for causing the deviation is mainly that AT is deleted in sequence similarity, such as sequence ATATAT, and the result of software report is probably Any one in three.We utilize the grid deviation D between formula below software for calculation result:
D=| P1-P2|
Wherein P1It is Indel1 origin coordinates, P2It is Indel2 origin coordinates.
The statistical result of multiple simulated experiment shows that non repetitive sequence area coordinate deviation range is in soybean gene group [1,31], it is equal to repetitive sequence length in repetitive sequence area coordinate deviation maximum.
The analysis of the five important Indel attributes of tetra-
Indel has four important attributes --- variation type (ST), variation size (SS), residing repeat region type (RT) With inspection software (DS).By testing result by this four attribute packets, describe for convenience, the present invention defines G (F, S) and represents collection The result that S presses attribute F packets is closed, such as G (ST, S) represents that set S is grouped by variation type, G (SS, G (ST, S)) represents set S is first grouped by variation type, then presses variation size packets again by that analogy.Simulated experiment as shown by data is in packet G (DS, G (RT, G (SS, G (ST, testing result)))) in, to the Indel in same type repetitive sequence and formed objects, different is soft Larger difference be present in the accuracy and the rate of recovery of part.Five softwares are deleting the detection of variation just to the 1bp in non repetitive sequence The distribution of true rate and the rate of recovery, wherein GATK possess the accuracy (99.83%) of maximum and possess the rate of recovery of minimum simultaneously (41.92%), Varscan possesses the rate of recovery (88.42%) of maximum.This explanation software be influence accuracy of detection it is important because Element.Same software is also deposited to the accuracy in different type repetitive sequence and different size of Indel detections and the rate of recovery In larger difference.Four above analytic explanation variation type, variation size, residing repeat region type and inspection software attributes are An important factor for influenceing Indel detection accuracy and the rate of recovery.
Optimal screening methods of six, based on optimal F values strategy
It can be seen from analysis above, see that the consistent results for extracting multiple softwares can improve accuracy from macroscopic view.But mould Intend in as shown by data testing result, the accuracy and the rate of recovery that some in G (SS (ST, the consistent testing result of two softwares)) are grouped All higher, some packet accuracy are high and the rate of recovery is low, the accuracy and the rate of recovery very low even zero of some packets.Therefore it is straight Engage and the consistent results of each two software can not obtain optimal accuracy.F values are for assessing accuracy and the rate of recovery The index of balance, F value calculation formula are shown below.
F=2 × p × r/ (p+r)
Wherein p is accuracy, and r is the rate of recovery.
The present invention has found two softwares in G (RT, G (SS, G (ST, the consistent results of two software))) by simulated experiment The F values of consistent results (IR) have stable changing rule, simultaneously for different packets optimal F values appear in it is different IR upper (Fig. 1).
G1 is the F values that each IR deletes 1bp in TIR type areas variation testing result in the figure, and G2 is each IR to SSR classes 9bp deletes the F values of variation testing result in type region.G1 optimal F values appear in Samtools and Varscan consistent results In, G2 optimal F values are appeared in Samtools and SOAPIndel consistent results.
Analysis based on more than, we provide an optimisation strategy directly perceived simply based on the optimal F values of packet:
1. establish optimization rule
Indel inspection softwares are selected, simulate chromosomal variation and sequence.Indel detections are carried out using instrument, calculate every two The united F values of individual software.The rule of an optimal selection is established by optimal F values.
2. Indel is selected according to the principle of optimality
Indel detections are carried out using software, are grouped according to DS, RT, SS, ST.Selected according to optimization rule Indel。
Made a variation from entirety, the accuracy (99.32%) of this method is higher than Samtools (97.46%), Pindel (94.69%), SOAPIndel (97.24%) and Varscan (98.59%), the rate of recovery (65.20%) are higher than GATKUnifiedGenotyper (25.50%) and Pindel (41.36%).
Screening techniques of seven, based on deep learning
The method of optimal F values is the consistent results based on software, thus can give up only by single software detection to Indel, and from the Indel knowable to analogue data only by single software detection close to accounting for the 20% of overall quantity, all give up tight Ghost image rings the rate of recovery.In order to more comprehensively take into account balance using the result of all softwares so as to obtain the higher rate of recovery Property, the present invention devises the testing result that method based on deep learning (Deep Learning) screens all softwares, we with All initial data are training set, to detect Indel software used, Indel type, and repetitive sequence type residing for Indel, The read quantity (coverage) for supporting Indel testing results is training characteristics, and accuracy rate and recall rate are training objective.Utilize instruction Practicing collection, we can train to obtain one and make the rate of recovery and recall rate model as high as possible.
We carry out the exploitation of deep learning program using TensorLayer, and TensorLayer is built upon Google Deep learning (Deep Learning) and enhancing study (Reinforcement Learning) software on TensorFlow Storehouse.

Claims (2)

1. a kind of computational methods of genome mutation data, it is characterised in that process is as follows:
1) optimization rule is established
Samtools, GATK, Varscan, Pindel and SOAPIndel software is selected to carry out Indel detections to analogue data, it is raw Into original I ndel data, the united F values of each two software are calculated, the rule of an optimal selection are established by optimal F values Then;
2) Indel is selected according to the principle of optimality
Indel detections, root are carried out to testing data using software Samtools, GATK, Varscan, Pindel and SOAPIndel It is grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.
2. computational methods according to claim 1, it is characterised in that the F=2 × p × r/ (p+r), wherein p are correct Rate, r are the rate of recovery.
CN201710761660.5A 2017-08-30 2017-08-30 A kind of computational methods of genome mutation data Pending CN107679366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710761660.5A CN107679366A (en) 2017-08-30 2017-08-30 A kind of computational methods of genome mutation data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710761660.5A CN107679366A (en) 2017-08-30 2017-08-30 A kind of computational methods of genome mutation data

Publications (1)

Publication Number Publication Date
CN107679366A true CN107679366A (en) 2018-02-09

Family

ID=61134390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710761660.5A Pending CN107679366A (en) 2017-08-30 2017-08-30 A kind of computational methods of genome mutation data

Country Status (1)

Country Link
CN (1) CN107679366A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810402A (en) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 Data processing method and device for genomes
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
US20160312303A1 (en) * 2015-04-23 2016-10-27 Quest Diagnostics Investments Incorporated Methods and compositions for the detection of calr mutations in myeloproliferative diseases
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810402A (en) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 Data processing method and device for genomes
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
US20160312303A1 (en) * 2015-04-23 2016-10-27 Quest Diagnostics Investments Incorporated Methods and compositions for the detection of calr mutations in myeloproliferative diseases
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丛培宽: "全基因组外显子测序发现X连锁显性遗传性高度近视疾病的致病基因及人类基因变异数据库LOVD的创建", 《中国博士学位论文全文数据库 信息科技辑》 *
王春宇: "生物高通量测序片段拼接与分子标记识别算法研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing
CN117106875B (en) * 2023-10-23 2024-02-06 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Similar Documents

Publication Publication Date Title
Rannala et al. Species delimitation
CN107025318A (en) Method and apparatus for exploring new material
CN106611106B (en) Genetic mutation detection method and device
CN109492765A (en) A kind of image Increment Learning Algorithm based on migration models
CN110363344A (en) Probability integral parameter prediction method based on MIV-GP algorithm optimization BP neural network
CN104331642B (en) Integrated learning method for recognizing ECM (extracellular matrix) protein
CN104809476B (en) A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition
CN101866316A (en) Software defect positioning method based on relative redundant test set reduction
CN107680018A (en) A kind of college entrance will based on big data and artificial intelligence makes a report on system and method
CN107451419A (en) It is a kind of that the method for simplifying DNA methylation sequencing data is produced by computer program simulation
CN102254020A (en) Global K-means clustering method based on feature weight
US20070179917A1 (en) Intelligent design optimization method and system
CN109951468A (en) A kind of network attack detecting method and system based on the optimization of F value
CN110990784A (en) Cigarette ventilation rate prediction method based on gradient lifting regression tree
CN105138975A (en) Human body complexion area segmentation method based on deep belief network
Rabier et al. On the inference of complex phylogenetic networks by Markov Chain Monte-Carlo
CN107679366A (en) A kind of computational methods of genome mutation data
CN108320797B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN106651167A (en) Biological information engineer skill rating system
Manolopoulou et al. BPEC: An R package for Bayesian phylogeographic and ecological clustering
CN109388875A (en) A kind of design implementation method of elastic element module
Zhao et al. Potato (Solanum tuberosum L.) tuber-root modeling method based on physical properties
CN103294828A (en) Verification method and verification device of data mining model dimension
CN111370055A (en) Intron retention prediction model establishing method and prediction method thereof
CN103970973B (en) A kind of simulation RCT analysis methods based on True Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209

RJ01 Rejection of invention patent application after publication