CN107679366A

CN107679366A - A kind of computational methods of genome mutation data

Info

Publication number: CN107679366A
Application number: CN201710761660.5A
Authority: CN
Inventors: 袁晓辉
Original assignee: Wuhan Ancient Gene Technology Co Ltd
Current assignee: Wuhan Ancient Gene Technology Co Ltd
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2018-02-09

Abstract

The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of computational methods of genome mutation data.Select Samtools, GATK, Varscan, Pindel and SOAPIndel software to carry out Indel detections to analogue data, generate original I ndel data, calculate the united F values of each two software, the rule of an optimal selection is established by optimal F values；Indel detections are carried out to testing data using software, are grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.The accuracy, the rate of recovery and F values of result can be improved.

Description

A kind of computational methods of genome mutation data

Technical field

The invention belongs to the biological information field of high-flux sequence, more particularly to a kind of calculating of genome mutation data Method.

Background technology

Variation detection is the basis of genome functions analysis in weight sequencing technologies, therefore whether testing result is accurate directly The accuracy of impact analysis result.Delivered in recent years in variation detection field in order to make up the deficiency of single software detection result Some integration algorithms based on multiple softwares.The rate of recovery is improved using software results are merged, utilizes the consistent inspection for extracting software Survey result and improve accuracy.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of computational methods of genome mutation data.

The present invention analyze Indel size and genome sequence feature to variation testing result accuracy and the rate of recovery Influence, it is proposed that the optimized algorithm based on optimal F values strategy.

Indel detection algorithms proposed by the present invention are the optimal screening algorithms for integrating multiple software detection results, selection Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel generation original I ndel data.This Four kinds of different algorithm detection Indel variations have been respectively adopted in five softwares.

(1) Samtools and GATK (UnifiedGenotyper) is the comparison knot based on sequencing data Yu reference gene group Fruit, the posterior probability that each loci gene type is calculated using Bayesian statistical model detect Indel.

(2) Pindel is based on read (unmapped reads) data not matched in comparison result, Land use models life Long algorithm detection insertion deletion variation.

(3) Varscan is the pileup data based on Samtools, is become using stable heuritic approach detection Indel It is different, and extreme read depth can be handled, mix the problems such as pond sequencing data and sequencing data are contaminated.

(4) SOAPIndel is to be recombinated all reads for not matching using De Bruijn graph algorithms, by and ginseng Examine genome alignment detection insertion deletion variation.

Optimized algorithm based on optimal F values strategy is as follows：

1) optimization rule is established

Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel software are selected to mould Intend data and carry out Indel detections, generate original I ndel data, calculate the united F values of each two software, pass through optimal F values Establish the rule of an optimal selection.

2) Indel is selected according to the principle of optimality

Treated using software Samtools, GATK (UnifidGenotyper), Varscan, Pindel and SOAPIndel Survey data and carry out Indel detections, be grouped according to DS, RT, SS, ST.Indel is selected according to optimization rule.

The present invention proposes Indel detection algorithms, can improve the accuracy, the rate of recovery and F values of result.

Brief description of the drawings

Fig. 1, F value tendency chart.

Fig. 2, genome mutation data acquisition schematic flow sheet.

Embodiment

The selection of one, softwares

Two, analogue datas

In order to which the sequence of the accuracy of the detailed each software I ndel testing results of research, the rate of recovery and genome is special Levy to the influence for detecting result, it is necessary to the specifying information of known all variations, including the position of variation, size and residing base Because of the feature in group region.Known variation is added in reference gene group using computer modeling technique for this present invention and generated newly Genome sequence, recycle simulation sequencing technologies generation sequencing data.Analogue data is as shown in table 1.

The variation distribution of table 1

Variation type	Size (bp)	Quantity
			SNP	1	1/1000 ratio
Indel	1-50	2792000
			Deletion/Insertion	51-500	20000
Duplication	100-500	1000
			Inversion	100-500	1000
Translocation	100-500	1000

Three, are compared and detection

Sequencing data and soybean reference gene group (William 82) are compared using BWA [Li and Durbin, 2009] Sam files are generated, sam files are converted into bam files with samtools view, with samtoolssort bam files by seat Mark sequence simultaneously establishes index with samtools rmdup duplicate removals redoubling samtools index.Then become with five software detections It is different, Varscan parameter " minimum sequencing depth " is arranged to 2, remaining software uses software default parameter.Finally extract result Middle 1-50bp Indel.

The criterion of four, consistent results

For the relation mutually supplying and be mutually authenticated between analysis software, it is necessary to clear and definite two software conformance results Criterion.There is document to propose two standards for the problem, one be overlapped rate more than 50%, another is that have one Base more than individual is overlapping [Lam et al., 2012].The two standards only consider the overlapping situation of testing result coordinate.But Because the difference of software algorithm can cause the difference even difference of size of result coordinate.

The present invention is in order to ensure the accuracy of testing result, it is specified that only size is identical can just be determined as same Indel. The present invention has found by simulated experiment in addition, is had differences for the coordinate of same Indel variation different software testing results, The reason for causing the deviation is mainly that AT is deleted in sequence similarity, such as sequence ATATAT, and the result of software report is probably Any one in three.We utilize the grid deviation D between formula below software for calculation result：

D=| P₁-P₂|

Wherein P₁It is Indel1 origin coordinates, P₂It is Indel2 origin coordinates.

The statistical result of multiple simulated experiment shows that non repetitive sequence area coordinate deviation range is in soybean gene group [1,31], it is equal to repetitive sequence length in repetitive sequence area coordinate deviation maximum.

The analysis of the five important Indel attributes of tetra-

Indel has four important attributes --- variation type (ST), variation size (SS), residing repeat region type (RT) With inspection software (DS).By testing result by this four attribute packets, describe for convenience, the present invention defines G (F, S) and represents collection The result that S presses attribute F packets is closed, such as G (ST, S) represents that set S is grouped by variation type, G (SS, G (ST, S)) represents set S is first grouped by variation type, then presses variation size packets again by that analogy.Simulated experiment as shown by data is in packet G (DS, G (RT, G (SS, G (ST, testing result)))) in, to the Indel in same type repetitive sequence and formed objects, different is soft Larger difference be present in the accuracy and the rate of recovery of part.Five softwares are deleting the detection of variation just to the 1bp in non repetitive sequence The distribution of true rate and the rate of recovery, wherein GATK possess the accuracy (99.83%) of maximum and possess the rate of recovery of minimum simultaneously (41.92%), Varscan possesses the rate of recovery (88.42%) of maximum.This explanation software be influence accuracy of detection it is important because Element.Same software is also deposited to the accuracy in different type repetitive sequence and different size of Indel detections and the rate of recovery In larger difference.Four above analytic explanation variation type, variation size, residing repeat region type and inspection software attributes are An important factor for influenceing Indel detection accuracy and the rate of recovery.

Optimal screening methods of six, based on optimal F values strategy

It can be seen from analysis above, see that the consistent results for extracting multiple softwares can improve accuracy from macroscopic view.But mould Intend in as shown by data testing result, the accuracy and the rate of recovery that some in G (SS (ST, the consistent testing result of two softwares)) are grouped All higher, some packet accuracy are high and the rate of recovery is low, the accuracy and the rate of recovery very low even zero of some packets.Therefore it is straight Engage and the consistent results of each two software can not obtain optimal accuracy.F values are for assessing accuracy and the rate of recovery The index of balance, F value calculation formula are shown below.

F=2 × p × r/ (p+r)

Wherein p is accuracy, and r is the rate of recovery.

The present invention has found two softwares in G (RT, G (SS, G (ST, the consistent results of two software))) by simulated experiment The F values of consistent results (IR) have stable changing rule, simultaneously for different packets optimal F values appear in it is different IR upper (Fig. 1).

G1 is the F values that each IR deletes 1bp in TIR type areas variation testing result in the figure, and G2 is each IR to SSR classes 9bp deletes the F values of variation testing result in type region.G1 optimal F values appear in Samtools and Varscan consistent results In, G2 optimal F values are appeared in Samtools and SOAPIndel consistent results.

Analysis based on more than, we provide an optimisation strategy directly perceived simply based on the optimal F values of packet：

1. establish optimization rule

Indel inspection softwares are selected, simulate chromosomal variation and sequence.Indel detections are carried out using instrument, calculate every two The united F values of individual software.The rule of an optimal selection is established by optimal F values.

2. Indel is selected according to the principle of optimality

Indel detections are carried out using software, are grouped according to DS, RT, SS, ST.Selected according to optimization rule Indel。

Made a variation from entirety, the accuracy (99.32%) of this method is higher than Samtools (97.46%), Pindel (94.69%), SOAPIndel (97.24%) and Varscan (98.59%), the rate of recovery (65.20%) are higher than GATKUnifiedGenotyper (25.50%) and Pindel (41.36%).

Screening techniques of seven, based on deep learning

The method of optimal F values is the consistent results based on software, thus can give up only by single software detection to Indel, and from the Indel knowable to analogue data only by single software detection close to accounting for the 20% of overall quantity, all give up tight Ghost image rings the rate of recovery.In order to more comprehensively take into account balance using the result of all softwares so as to obtain the higher rate of recovery Property, the present invention devises the testing result that method based on deep learning (Deep Learning) screens all softwares, we with All initial data are training set, to detect Indel software used, Indel type, and repetitive sequence type residing for Indel, The read quantity (coverage) for supporting Indel testing results is training characteristics, and accuracy rate and recall rate are training objective.Utilize instruction Practicing collection, we can train to obtain one and make the rate of recovery and recall rate model as high as possible.

We carry out the exploitation of deep learning program using TensorLayer, and TensorLayer is built upon Google Deep learning (Deep Learning) and enhancing study (Reinforcement Learning) software on TensorFlow Storehouse.

Claims

1. a kind of computational methods of genome mutation data, it is characterised in that process is as follows：

1) optimization rule is established

Samtools, GATK, Varscan, Pindel and SOAPIndel software is selected to carry out Indel detections to analogue data, it is raw Into original I ndel data, the united F values of each two software are calculated, the rule of an optimal selection are established by optimal F values Then；

2) Indel is selected according to the principle of optimality

Indel detections, root are carried out to testing data using software Samtools, GATK, Varscan, Pindel and SOAPIndel It is grouped according to DS, RT, SS, ST, Indel is selected according to optimization rule.

2. computational methods according to claim 1, it is characterised in that the F=2 × p × r/ (p+r), wherein p are correct Rate, r are the rate of recovery.