CN109935274A

CN109935274A - A kind of long reading overlay region detection method based on k-mer distribution characteristics

Info

Publication number: CN109935274A
Application number: CN201910156934.7A
Authority: CN
Inventors: 阎朝坤; 罗军伟; 张戈; 王建林; 和婧
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2019-06-25
Anticipated expiration: 2039-03-01
Also published as: CN109935274B

Abstract

The long reading overlay region detection method based on k-mer distribution characteristics that the present invention relates to a kind of, extracts all k-mer in long reading set, and be deposited into Hash table first；For two specific long readings, their shared every a pair of k-mer are extracted, wherein each shared k-mer is indicated with a binary group；Then a shared k-mer figure, one shared k-mer of each node on behalf in the figure are constructed；Then scheme to determine starting and the final position between the two length readings with the presence or absence of overlay region and overlay region on the two length readings using the shared k-mer.The present invention is easy to use, and accuracy is good；Good overlay region testing result is shown on real data set Escherichia coli, more other overlay region detection methods have in multinomial evaluation index more preferably to be showed.

Description

A kind of long reading overlay region detection method based on k-mer distribution characteristics

Technical field

The present invention relates to the sequence assembling field of bioinformatics, especially a kind of long reading based on k-mer distribution characteristics Number overlay region detection method.

Background technique

In bioinformatics, the groundwork of genome research is that the gene of magnanimity is sequenced, the purpose is to for The hereditary information in organism in genome is obtained, is then applied to gene diagnosis, it is numerous that gene and disease association are analyzed etc. Field.Genome refers to that intracellular all hereditary information, this hereditary information are stored in the form of nucleotide sequence, wherein deoxidation Nucleotide DNA molecule is made of four kinds of nitrogenous bases: adenine (A), thymidine (T), cytimidine (C) and guanine (G).Two Nucleotide constitutes DNA sequence dna according to base pair complementarity, and base A and T, G and C constitute base-pair.Pass through gene order-checking skill Art can obtain base sequence segment in lots of genes group sequence (reading or read).Sequence assembling is obtained by these The process of sequence fragment reduction whole gene group DNA sequence dna.

The chain termination method and Maxam and Gilbert that first generation DNA sequencing technology mainly has Sanger and Coulson to start The chemical method (chain degradation) of invention, Sanger in 1977 determine first genome sequence --- bacteriophage phiX-174, from This mankind obtains the ability for exploring life quintessence, and has and thus really stepped into the genomics epoch for the beginning.Although the Generation sequencing technologies accuracy can reach 99.999%, but its cost is very high, and flux is very low, and large-scale application is faced with It is difficult.Skill is sequenced using 454 technologies of Roche company, the Solexa/HiSeq technology of illumina company as the second generation of representative Art uses polymerase chain amplification (PCR, the Polymerase Chain Reaction) technology and first being sequenced in synthesis It is compared for sequencing technologies, although largely improving sequencing speed, reduces sequencing cost cost, and maintain relatively high Accuracy, but the disadvantage is that introduced PCR process can increase the error rate of sequencing to a certain extent, and there is system to be biased to Property, while reading is also shorter, in order to effectively solve defect present in second generation sequencing technologies, with Pacific Ocean bioscience public affairs Take charge of the real-time single-molecule DNA sequencing (Single Molecule Real Time, SMRT) of (Pacific Biosciences, PB) The Oxford of technology and Oxford Nanotec Solution nanometer single molecule techniques (Oxford Nanopore Technology, ONT) be The third generation sequencing technologies of representative come into being, its essential characteristics are single-molecule sequencings, do not need the process of any PCR, have System mistake caused by GC skewed popularity caused by effect is avoided because of PCR, while reading length is greatly improved, and maintained for two generations The high throughput of sequencing technologies, inexpensive advantage.Only drawback is that third generation sequencing technologies error rate is higher, such as SMRT Single-molecule sequencing technology can generate 13~18% insertion, replacement and missing errors, and need to provide longer overlay length just can be high Effect correction mistake.

Overlay region (overlap) detection is the first step of long reading assembling OLC algorithm, refers to and compares each read Compared with calculating the overlap length two-by-two between reads, pick out the reads of the condition of satisfaction, the standard selected includes overlapping region The minimum value for the specific gravity that minimum length, various sizes of k-mers (base sequence that length is k) set, identical overlapping region account for Deng.The obtained reads for having overlap is generated into an overlay chart, wherein each node is read, each side is connection The part of their overlapping traverses the topology layout that this figure just generates read, i.e. layout in OLC algorithm, and then using should Layout constructs consistent sequence, i.e. the final step consensus of OLC algorithm.Large-scale genome is being assembled using OLC algorithm When, overlap detection is the bottleneck step of sequence assembling process.Overlap detects not the still important step of OLC algorithm, and And it is also often required to and is carried out between reads in from the beginning error correction that long reading assembles (De novo assembly) Overlap detection.The reads of error correction has higher confidence level in overlap step due to the reduction of error rate With simpler effect.Therefore, initial overlap detection is the basis of subsequent assembling step, gene can be assembled correctly and Analysis is vital.

The long reading that third generation sequencing technologies generate can reach tens of thousands of bases, although long reading can be across most of weight Multiple area, but its sequencing error rate is too high, existing long reading overlap detection algorithm is main comprising the following three steps: (1) Find seed sequence (k-mer sequence) common between reads.Since PB and ONT sequencing technologies error rate is relatively high, seed sequence The length of column is generally all shorter；(2) candidate reads is determined.Usually compare the identical seed sequence position of two reads, and And there is consistent distance with other seeds.(3) the identical seed region for detecting candidate reads, determines between two reads Overlapping region, finally determine whether to complete detection by overlap.

MHAP is a kind of MinHash algorithm based on k-mer similitude to realize that the overlap to two reads detects work Tool, MHAP first hash k-mer set using multiple hash functions, establish sketch column with the minimum value of hash function Table, so that it is determined that the median position of overlapping region, then reuses shorter k-mer sequential value and carry out MinHash Algorithm determines final overlapping region, so that it is determined that relatively accurately overlap testing result.Minimap tool is one Overlaper/mapping tool, it combines many algorithms.The function of DALIGNER caching is such as combined, it is first Its minimum program listing is stored in an array, step then is merged to seed again and is ranked up, and Minimap automatically will Data set is divided into several batches, to reduce use of the large data sets on memory.The minimum device of MHAP is used, for minimum Change the effect for repeating k-mer, when selecting min-k-mers, Minimap uses a reversible hash function, prevents hashed value Corresponding to a Sequences of Low Complexity.Minimap finally applies the conllinear chain function of cluster of the matching seed of GraphMap, receives Collect approximate conllinear shared k-mer in all cachings, so that it is determined that last testing result.For each of detecting The overlay information of overlap is with the output of PAF (Pairing mapping format) format.

Currently, the overlap detecting step in the long reading sequence assembling based on third generation sequencing technologies has well The method of solution, still, there are still some shortcomings in some problems, are worth further and explore and study:

(1) sequencing error of genetic fragment, due to the problem of tool itself is sequenced, as SMRT and ONT can generate 15% Random error, it is possible that part is really filtered in the region overlap, is easy last when finding overlapping region Occur a large amount of gap (one section of all N sequence) region when assembling, the subsequent analysis and performance study of gene order are likely to There is very big deviation.

(2) overlay region detection is the maximum step of calculation amount during gene assembling, can effectively reduce calculating cost, It improves computational efficiency and accuracy rate is always the ultimate aim of overlay region detection.

The presence of these problems is the major obstacles that the existing gene order assembling tool of limitation further develops.

Summary of the invention

In view of the above shortcomings of the prior art, the present invention provides a kind of long reading overlay region based on k-mer distribution characteristics Detection method, easy to use, accuracy is high.

In order to solve the above technical problems, the technical scheme adopted by the invention is that: a kind of length based on k-mer distribution characteristics Read overlay region detection method, comprising the following steps:

1) k-mer Hash table is created；

It extracts each length in long reading set and reads included k-mer, wherein k-mer is the substring that length is k； Assuming that the length of a long reading is L, then the quantity of k-mer is (L-k+1) in length reading；Then k-mer Hash is created Table, wherein the key of the table is the cryptographic Hash of k-mer, and value value includes: that the serial number of long reading, k-mer reads in the length Beginning position and compare direction；Then the shared k- that two different long readings have can be quickly found by k-mer Hash table Mer set；

2) a pair of long reading processed not yet of selection from long reading set, and carry out overlay region detection；

2.1) shared k-mer set is calculated；

Assuming that the long reading of this pair are as follows: R1 and R2；If the direction that a k-mer occurs in R1 and R2 is consistent, claim The forward direction that this k-mer is R1 and R2 shares k-mer；If the direction that this k-mer occurs in R1 and R2 is inconsistent, claim The reversed shared k-mer that this k-mer is R1 and R2；The forward direction that they have is found first and shares k-mer, and with count0 table Then the number for showing their positive shared k-mer finds the reversed shared k-mer that they have again, and indicates it with count1 Reversed shared k-mer number；If count0 > count1 and count0 > count, only retain positive shared k- Mer only retains reversely shared k-mer, wherein count=5, remains if count1 > count0 and count1 > count Remaining situation, then it is assumed that the two length readings do not have overlay region, and restart from step 2)；

2.2) shared k-mer figure is constructed；

Using the common k-mer of each of R1 and R2 as a node, for two nodes, if between they are on R1 Away from meeting some requirements with their spacing on R2, a line is added, it, then can be with when having handled all nodes two-by-two The shared k-mer figure of building one；

2.3) overlapping region and the digital simulation degree between k-mer node are obtained；

It is shared in k-mer figure at this, obtains maximal connected subgraphs, one shared k- of each of subgraph node on behalf Mer calculates spacing of any two adjacent node on R1 in the subgraph, between then calculating any two node on R2 Away from by comparing the difference between the spacing in the spacing and R2 on R1, the fitting angle value of the subgraph being calculated, if the value Greater than S (S=0.9), then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and again from step 2) Start；

2.4) starting and the final position for whether having overlay region and overlay region between R1 and R2 determined；

Position of the minimum node and maximum node of subgraph node on R1 first, a corresponding region；Subgraph node The position of minimum node and maximum node on R2 also corresponds to a region；Then to the length in the two regions and they Starting and final position judged, it is final to determine between R1 and R2 whether there is overlay region, if there is overlay region, really Fixed starting and final position of the overlay region on R1 and R2；

3) step 2) is returned to, until all readings long two-by-two have detected.

On that basis of the above technical scheme, the step 2.2) specifically:

2.2.1) each shared k-mer is expressed as a binary group；I-th of shared k-mer is indicated between R1 and R2 For (P_i, Q_i), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2；Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows₁, Q₁), (P₂, Q₂), (P₃, Q₃) ..., (P_m, Q_m)；

2.2.2) the shared k-mer between R1 and R2 is ranked up；Compare the Index2 of the Index1 and R2 of R1 first Size, if Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, if Index1 < Index2, then the initial position according to them on R2 carries out incremental arrangement；

2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one Share k-mer and binary group；For sorted sequence node, first from the node (P of left end first₁, Q₁) conduct Start node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) Any one, then a line is added, side is not otherwise added；After having handled all subsequent nodes of the start node, if side Number be greater than count/2, count is the number of all nodes, then go to step 2.3) carry out subsequent processing；Otherwise, it selects Node (the P of left end second₂, Q₂) as new start node, the condition for allowing it to be added side with its all subsequent node is sentenced It is disconnected；And so on, until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, And it goes to step 2.3) and carries out subsequent processing；If can not find such start node, then it is assumed that the two length readings do not have Overlay region, and restart from step 2)；

Wherein, the condition for judging whether two nodes add side is as follows:

For two node (P_i, Q_i) and (P_j, Q_j):

If (a) they are positive shared k-mer, meet the following conditions, then the edged between the two nodes:

First condition: P_i< P_jAnd Q_i< Q_j；Second condition: abs ((P_j-P_i)-(Q_j-Q_i)) < t；Wherein t= 100+min((P₂-P₁),(Q₂-Q₁))/5, if t > 500, t=500；

If (b) they are reversely to share k-mer, meet the following conditions, then the edged between the two nodes:

First condition: P₁< P₂And Q₁> Q₂；Second condition: abs ((P_j-P_i)-(Q_i-Q_j)) < t；Wherein t= 100+min((P_j-P_i),(Q_i-Q_j))/5, if t > 500, t=500.

On that basis of the above technical scheme, the step 2.3) specifically:

2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps；Total conode Number is m, sorted m available in this way shared k-mer；

2.3.2) each shared corresponding position of the k-mer on R1 is respectively P1, P2, P3 ..., and Pm corresponds to position on R2 Set respectively Q1, Q2, Q3 ..., Qm；On R1, calculate between i-th of shared k-mer (i ≠ 1) and its previous shared k-mer Every note X_i=P_i-P_i-1, after having handled all shared k-mer, available m-1 interval, i.e. X₁=P₂-P₁, X₂=P₃- P₂..., X_m-1=P_m-P_m-1；On R2, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer, note are calculated Y_i=Q_i-Q_i-1, after having handled all shared k-mer, available m-1 interval, i.e. Y₁=Q₂-Q₁, Y₂=Q₃-Q₂..., Y_m-1=Q_m-Q_m-1；Calculation formula according to the above interval information, digital simulation degree, the value is as follows:

The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher, wherein n=m- closer to 1 1；If the value is greater than S, wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and Restart from step 2).

On that basis of the above technical scheme, the step 2.4) specifically:

Start node and terminal node are calculated according to subgraph node, it can be in R₁On obtain a section [P₁, P_m+ k], it can In R₂On also obtain a section [Q₁, Q_m+ k], the two sections are exactly R₁And R₂Corresponding overlay region；Since mistake is sequenced Presence, the starting of true overlay region and final position may shift with above-mentioned two section, it is assumed that in R₁On it is true Area is [SP₁, EP₁], in R₂On true overlay region be [SP₂, EP₂]；R₁And R₂Length be respectively Len₁And Len₂；

True overlay region starting and final position are obtained using following condition and formula；

If P₁> Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=P₁-Q₁, EP₁=Len₁, It is SP on R2₂=1, EP₂=Q_m+Len₁-P_m；

If P₁<=Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=1, EP₁=Len₁, in R2 Upper is SP₂=Q₁-P₁, EP₂=Q_m+Len₁-P_m；

If P₁> Q₁And Len₁-P_m> Len₂-Q_m, overlay region is obtained on R1, EP₁=P_m+Len₂-Q_m, SP₁=P₁- Q₁, it is SP on R2₂=1, EP₂=Len₂；

If P₁<=Q₁And Len₁-P_m> Len₂-Q_m, obtain overlay region EP on R1₁=P_m+Len₂-Q_m, SP₁=1, It is SP on R2₂=Q₁-P₁, EP₂=Len₂；

By handling to obtain true starting and final position of the overlay region on R1 and R2 above；

From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP₁-SP₁；Comparison length on R2 is R2OverlapLen=EP₂-SP₂；In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length: MinOverlapLen=min (R1OverlapLen, R2OverlapLen),

MaxOverlapLen=max

(R1OverlapLen, R2OverlapLen)；

a)(EP₁-SP₁)-(P_m+k-P₁) < δ, and (EP2-SP2)-(Q_m+ k-Q1) < δ；I.e. final initial position and true Fullsized cannot be greater than δ (δ=500) to initial position difference；

B) (ε=500) MinOverlapLen > ε；

c)That is corresponding comparison siding-to-siding block length cannot differ too big on two long readings；

When R1 and R2 meets above three condition, then claim R1 and R2 that there is overlapping relation, i.e. section on R1 [SP₁, EP₁] and R2 on section [SP₂, EP₂] overlap；Otherwise it is assumed that R1 and R2 does not have overlapping relation, and from step 2) Restart；

If shared k-mer is positive shared k-mer, this two long reading has positive overlay region；If shared k- Mer is reversely to share k-mer, then two long reading has reversed overlay region side.

The utility model has the advantages that the long reading overlay region detection method provided by the invention based on k-mer distribution characteristics, simple easy With high-efficient and accuracy is high.

Detailed description of the invention

Fig. 1 is long reading k-mer position comparison chart of one embodiment of the invention a pair with overlap；

Fig. 2 is one embodiment of the invention overlay region exemplary diagram.

Specific embodiment

A kind of long reading overlay region detection method based on k-mer distribution characteristics provided by the invention, specific steps process It is as follows:

One, k-mer Hash table is created.

It extracts each length in long reading set and reads included k-mer.Assuming that the length of a long reading is L, Then the quantity of k-mer is (L-k+1) in length reading.Then k-mer Hash table is created, wherein the key of the table is k-mer Cryptographic Hash, value value include: the serial number of long reading, and k-mer is in the initial position that the length is read and compares direction.Can then it lead to K-mer Hash table is crossed, the shared k-mer set that two different long readings have is quickly found.

Two, a pair of long reading processed not yet of selection from long reading set, and carry out overlay region detection.

2.1) shared k-mer set is calculated.Assuming that the long reading of this pair are as follows: R1 and R2.Finding them first has Forward direction share k-mer, i.e., each positive shared k-mer can be compared in the same direction on this two long readings, and with Count0 indicates the number of their positive shared k-mer.Then find the reversed shared k-mer that they have again, i.e., it is each total The direction for having k-mer to compare onto the two length readings is inconsistent, and of their reversed shared k-mer is indicated with count1 Number.If count0 > count1 and count0 > count, only retain positive shared k-mer, if count1 > Count0 and count1 > count then only retains reversely shared k-mer, wherein count=5.Remaining situation, then it is assumed that The two length readings do not have overlay region, and restart from step 2).

2.2) shared k-mer figure is constructed.Using the common k-mer of each of R1 and R2 as a node.It is saved for two Point adds a line if their spacing on R1 and their spacing on R2 meet some requirements.When having handled All nodes can then construct a shared k-mer figure.

2.2.1) each shared k-mer is expressed as a binary group.I-th of shared k-mer is indicated between R1 and R2 For (P_i, Q_i), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2.Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows₁, Q₁), (P₂, Q₂), (P₃, Q₃) ..., (P_m, Q_m), as shown in Figure 1.

2.2.2) the shared k-mer between R1 and R2 is ranked up.Compare the Index2 of the Index1 and R2 of R1 first Size, if Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, If Index1 < Index2, incremental arrangement is carried out according to their initial positions on R2.

2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one Share k-mer and binary group.For sorted sequence node, first from the node (P of left end first₁, Q₁) conduct Start node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) Any one, then a line is added, side is not otherwise added.After having handled all subsequent nodes of the start node, if side Number be greater than count/2, count is the number of all nodes, then go to step 2.3) carry out subsequent processing.Otherwise, it selects Node (the P of left end second₂, Q₂) as new start node, the condition for allowing it to be added side with its all subsequent node is sentenced It is disconnected.And so on, until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, And it goes to step 2.3) and carries out subsequent processing.If can not find such start node, then it is assumed that the two length readings do not have Overlay region, and restart from step 2).

Wherein, the condition for judging whether two nodes add side is as follows:

If a) they are positive shared k-mer, meet the following conditions, then the two nodes are increased into a line.First A condition: P_i< P_jAnd Q_i< Q_j.Second condition: abs ((P_j-P_i)-(Q_j-Q_i)) < t wherein t=100+min ((P₂- P₁),(Q₂-Q₁))/5, if t > 500, t=500, because there are 20% or so errors for third generation sequencing technologies, remove With 5.

If b) they are reversely to share k-mer, meet the following conditions, then the two nodes is increased into a line.First A condition: P₁< P₂And Q₁> Q₂.Second condition: abs ((P_j-P_i)-(Q_i-Q_j)) < t, wherein t=100+min ((P_j- P_i),(Q_i-Q_j))/5, if t > 500, t=500.

2.3) overlapping region and the digital simulation degree between k-mer node are obtained.It shares in k-mer figure, obtains at this The most subgraph of node, one shared k-mer of each of subgraph node on behalf.It is adjacent to calculate any two in the subgraph Then spacing of the node on R1 calculates spacing of any two node on R2.By comparing in the spacing and R2 on R1 The fitting angle value of the subgraph is calculated in difference between spacing, if the value is greater than S (S=0.9), continues subsequent processing, Otherwise it is assumed that the two length readings do not have overlay region, and restart from step 2).

2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps.Total conode Number is m, sorted m available in this way shared k-mer.

2.3.2) by 2.3.1) this is known to including that m relevance shares more by force k-mer on reads, as shown in Figure 1, k- Corresponding position of the mer on R1 is respectively P₁, P₂, P₃..., P_m, corresponding position is respectively Q on R2₁, Q₂, Q₃..., Q_m.In R1 On, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, remembers X_i=P_i-P_i-1, all when having handled After shared k-mer, available m-1 interval, i.e. X₁=P₂-P₁, X₂=P₃-P₂..., X_m-1=P_m-P_m-1；On R2, calculate Y is remembered at the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer_i=Q_i-Q_i-1, when having handled all shared k- After mer, available m-1 interval, i.e.,

Y₁=Q₂-Q₁, Y₂=Q₃-Q₂..., Y_m-1=Q_m-Q_m-1.According to the above interval information, the power of the subgraph node is calculated Value, the calculation formula of the weight are as follows:

The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher closer to 1, wherein n=m- 1.If the value is greater than S, wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and Restart from step 2).

2.4) starting and the final position for whether having overlay region and overlay region between R1 and R2 determined.Subgraph section first Position of the minimum node and maximum node of point on R1, a corresponding region.The minimum node and maximum node of subgraph node Position on R2 also corresponds to a region.Then the length to the two regions and their starting and final position into Row judgement, it is final to determine between R1 and R2 whether there is overlay region, if having overlay region, it is determined that the overlay region is in R1 and R2 On starting and final position.

Start node and terminal node are calculated according to node, as shown in Fig. 2, a section [P can be obtained on R1₁, P_m+ K], a section [Q can be also obtained on R2₁, Q_m+k].The two sections are exactly overlay region corresponding to R1 and R2.Due to surveying The presence of sequence mistake, true overlay region starting and final position may shift with above-mentioned two section, it is assumed that on R1 True area be [SP₁, EP₁], the true overlay region on R2 is [SP₂, EP₂].Assuming that the length of R1 and R2 be respectively Len1 and Len2。

We obtain true overlay region starting and final position using following condition and formula.

If P₁> Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=P₁-Q₁, EP₁=Len₁, It is SP on R2₂=1, EP₂=Q_m+Len₁-P_m。

If P₁<=Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=1, EP₁=Len₁, in R2 Upper is SP₂=Q₁-P₁, EP₂=Q_m+Len₁-P_m。

If P₁> Q₁And Len₁-P_m> Len₂-Q_m, overlay region is obtained on R1, SP₁=P₁-Q₁, EP₁=P_m+Len₂- Q_m, it is SP on R2₂=1, EP₂=Len₂。

If P₁<=Q₁And Len₁-P_m> Len₂-Q_m, obtain overlay region EP on R1₁=P_m+Len₂-Q_m, SP₁=1, It is SP on R2₂=Q₁-P₁, EP₂=Len₂。

By handling to obtain true starting and final position of the overlay region on R1 and R2 above.

(if EP1-SP1)-(P_m+k-P₁) < δ, and (EP2-SP2)-(Q_m+ k-Q1) < δ.I.e. final initial position with The true initial position difference that compares cannot be greater than δ, it is assumed that δ=500.

From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP₁-SP₁；Comparison length on R2 is R2OverlapLen=EP₂-SP₂.In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length MinOverlapLen=min (R1OverlapLen, R2OverlapLen), MaxOverlapLen=max (R1OverlapLen, R2OverlapLen).The reads for meeting overlay region also needs to meet the following conditions first:

A) minimum value, that is, MinOverlapLen the > ε, ε for really comparing length is an overlapping region threshold value of setting, is write from memory Recognizing ε=500 can also be other values.

B) the correspondence section of two reads cannot differ too big: i.e. MinOverlapLen*3 > MaxOverlapLen*2.

Compare if it is forward direction: the direction of two read is consistent, compares that lap position is end to end or a read quilt Another read is all covered.Compare if it is reversed: two read's is contrary, i.e., wherein a read has been carried out reversely Complementary operation, the lap position of comparison are all covered in the stem of read or tail portion or a read by another read Lid.

Three, it is reads pairs remaining to continue cycling through detection, until all being compared two-by-two between all reads.

Four, experimental verification

4.1 data sets and evaluation index

In order to verify the validity of this method, this method is at species Escherichia coli (Escherichia coli/E.coli) On long readings collection on tested, and be compared analysis with currently a popular other two kinds of overlay region detection methods. It is that two different third generation sequencing technologies (survey in real time by big flat foreign sequencing company unimolecule using two kinds long readings collection Sequence technology and Oxford nano-pore sequencing technology) it obtains, be respectively: ReadSet1 and ReadSet2.The detailed letter of long reading set Breath is shown in Table 1.

The long readings collection of table 1

2 evaluation index of table

Note: True Positive, TP: true classification is positive example, and prediction classification is positive example；False Positive, FP: True classification is negative example, and prediction classification is positive example；False Negative, FN: true classification is positive example, and prediction classification is negative Example；True Negative, TN: true classification is negative example, and prediction classification is negative example.

In order to evaluate the accuracy of overlay region detection method, Blasr tool is used first, and the comparison of reads data set is arrived With reference on gene, according to comparison result, read-around ratio is calculated to initial position and end position on reference gene, to calculate reading Overlapping region between number and reading, determines the overlap result of standard.And result and other methods and this method are compared Compared with comparison result is evaluated from many aspects, wherein main evaluation index is shown in Table 2.

4.2, the comparison between Overlap detection method

This method is detected as standard and other two popular sides overlap with the Overlap that Blasr tool generates Method compares, both overlap detection methods include: MHAP and minimap, and this method is named as FLRO.Specific ratio 3 and table 4 are shown in Table compared with evaluation result.

Table 3 detects evaluation result based on the overlap of the long reading set of the real-time sequencing technologies of unimolecule

4.2.1, the overlap based on the long reading set of the real-time sequencing technologies of unimolecule detects evaluation result

Overlap detection, evaluation knot are carried out first with the long reading set obtained by the real-time sequencing technologies of unimolecule Fruit is shown in Table 3.It will be seen that our method accuracy rate and other methods is equally matched on all data sets, commenting Obvious high and other two methods in valence index recall rate illustrate that this method can more find out correct matched overlap, But detection number is more, and weak tendency is just in precision, but on overall target F-Measure, the index highest of this method, Effect is best.

4.2.2, the overlap based on Oxford nano-pore sequencing Chief Technology Officer reading set detects evaluation result

We carry out overlap detection followed by the long reading set obtained by Oxford nano-pore sequencing technology, comment Valence the results are shown in Table 4.It will be seen that our method will be substantially better than other in many index and refer on data set Mark, especially compares result with MHAP and exists, be higher by 24% and 32% respectively in recall rate and overall target F-Measure.

Table 4 detects evaluation result based on the overlap of Oxford nano-pore sequencing Chief Technology Officer reading set

Claims

1. a kind of long reading overlay region detection method based on k-mer distribution characteristics, which comprises the following steps:

1) k-mer Hash table is created；

It extracts each length in long reading set and reads included k-mer, wherein k-mer is the substring that length is k；Assuming that The length of one long reading is L, then the quantity of k-mer is (L-k+1) in length reading；Then k-mer Hash table is created, In the table key be k-mer cryptographic Hash, value value includes: the serial number of long reading, k-mer in the initial position that the length is read With comparison direction；Then the shared k-mer set that two different long readings have can be quickly found by k-mer Hash table；

2.1) shared k-mer set is calculated；

Assuming that the long reading of this pair are as follows: R1 and R2；If the direction that a k-mer occurs in R1 and R2 is consistent, claim this The forward direction that k-mer is R1 and R2 shares k-mer；If the direction that this k-mer occurs in R1 and R2 is inconsistent, claim this The reversed shared k-mer that k-mer is R1 and R2；The forward direction that they have is found first and shares k-mer, and it is indicated with count0 Positive shared k-mer number, then find the reversed shared k-mer that they have again, and indicate them instead with count1 To the number of shared k-mer；If count0 > count1 and count0 > count, only retain positive shared k-mer, If count1 > count0 and count1 > count, only retain reversely shared k-mer, wherein count=5, remaining Situation, then it is assumed that the two length readings do not have overlay region, and restart from step 2)；

2.2) shared k-mer figure is constructed；

Using the common k-mer of each of R1 and R2 as a node, for two nodes, if their spacing on R1 and Their spacing on R2 meet some requirements, and add a line, when having handled all nodes two-by-two, then can construct One shared k-mer figure；

It shares in k-mer figure at this, obtains maximal connected subgraphs, one shared k-mer of each of subgraph node on behalf, Spacing of any two adjacent node on R1 in the subgraph is calculated, spacing of any two node on R2 is then calculated, leads to The difference compared between the spacing in spacing and R2 on R1 is crossed, the fitting angle value of the subgraph is calculated, if the value is greater than S (S=0.9), then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and restart from step 2)；

Position of the minimum node and maximum node of subgraph node on R1 first, a corresponding region；The minimum of subgraph node The position of node and maximum node on R2 also corresponds to a region；Then the length to the two regions and their Begin and final position is judged, it is final to determine between R1 and R2 whether there is overlay region, if having overlay region, it is determined that should Starting and final position of the overlay region on R1 and R2；

3) step 2) is returned to, until all readings long two-by-two have detected.

2. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.2) specifically:

2.2.1) each shared k-mer is expressed as a binary group；I-th of shared k-mer is expressed as (P between R1 and R2_i, Q_i), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2；Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows₁, Q₁), (P₂, Q₂), (P₃, Q₃) ..., (P_m, Q_m)；

2.2.2) the shared k-mer between R1 and R2 is ranked up；Compare the Index2 size of the Index1 and R2 of R1 first, If Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, if Index1 < Index2, then the initial position according to them on R2 carries out incremental arrangement；

2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one shared K-mer and binary group；For sorted sequence node, first from the node (P of left end first₁, Q₁) as starting Node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) any One, then a line is added, side is not otherwise added；After having handled all subsequent nodes of the start node, if on side Number is greater than count/2, and count is the number of all nodes, then goes to step 2.3) and carry out subsequent processing；Otherwise, left end is selected Second node (P₂, Q₂) as new start node, allow it to be added the condition judgement on side with its all subsequent node；With This analogizes, and until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, and turns Subsequent processing is carried out to step 2.3)；If can not find such start node, then it is assumed that the two length readings do not have overlapping Area, and restart from step 2)；

Wherein, the condition for judging whether two nodes add side is as follows:

For two node (P_i, Q_i) and (P_j, Q_j):

First condition: P_i< P_jAnd Q_i< Q_j；Second condition: abs ((P_j-P_i)-(Q_j-Q_i)) < t；Wherein t=100+ min((P₂-P₁),(Q₂-Q₁))/5, if t > 500, t=500；

First condition: P₁< P₂And Q₁> Q₂；Second condition: abs ((P_j-P_i)-(Q_i-Q_j)) < t；Wherein t=100+ min((P_j-P_i),(Q_i-Q_j))/5, if t > 500, t=500.

3. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.3) specifically:

2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps；The number of total conode For m, sorted m available in this way shared k-mer；

2.3.2) each shared corresponding position of the k-mer on R1 is respectively P1, P2, P3 ..., Pm, the corresponding position point on R2 Not Wei Q1, Q2, Q3 ..., Qm；On R1, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, Remember X_i=P_i-P_i-1, after having handled all shared k-mer, available m-1 interval, i.e. X₁=P₂-P₁, X₂=P₃-P₂..., X_m-1=P_m-P_m-1；On R2, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, remembers Y_i=Q_i- Q_i-1, after having handled all shared k-mer, available m-1 interval, i.e. Y₁=Q₂-Q₁, Y₂=Q₃-Q₂..., Y_m-1=Q_m- Q_m-1；Calculation formula according to the above interval information, digital simulation degree, the value is as follows:

The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher, wherein n=m-1 closer to 1；Such as The fruit value is greater than S, and wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and from step It is rapid 2) to restart.

4. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.4) specifically:

Start node and terminal node are calculated according to subgraph node, it can be in R₁On obtain a section [P₁, P_m+ k], it can be in R₂ On also obtain a section [Q₁, Q_m+ k], the two sections are exactly R₁And R₂Corresponding overlay region；Due to depositing for sequencing mistake The starting of true overlay region and final position may shift with above-mentioned two section, it is assumed that in R₁On true area be [SP₁, EP₁], in R₂On true overlay region be [SP₂, EP₂]；R₁And R₂Length be respectively Len₁And Len₂；

If P₁> Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=P₁-Q₁, EP₁=Len₁, on R2 For SP₂=1, EP₂=Q_m+Len₁-P_m；

If P₁<=Q₁And Len₁-P_m<=Len₂-Q_m, obtain overlay region SP on R1₁=1, EP₁=Len₁, it is on R2 SP₂=Q₁-P₁, EP₂=Q_m+Len₁-P_m；

If P₁> Q₁And Len₁-P_m> Len₂-Q_m, overlay region is obtained on R1, EP₁=P_m+Len₂-Q_m, SP₁=P₁-Q₁, It is SP on R2₂=1, EP₂=Len₂；

If P₁<=Q₁And Len₁-P_m> Len₂-Q_m, obtain overlay region EP on R1₁=P_m+Len₂-Q_m, SP₁=1, in R2 Upper is SP₂=Q₁-P₁, EP₂=Len₂；

From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP₁-SP₁；Comparison length on R2 is R2OverlapLen=EP₂-SP₂；In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length: MinOverlapLen=min (R1OverlapLen, R2OverlapLen), MaxOverlapLen=max (R1OverlapLen, R2OverlapLen)；

a)(EP₁-SP₁)-(P_m+k-P₁) < δ, and (EP2-SP2)-(Q_m+ k-Q1) < δ；I.e. final initial position and true ratio δ (δ=500) cannot be greater than to initial position difference；

B) (ε=500) MinOverlapLen > ε；

When R1 and R2 meets above three condition, then claim R1 and R2 that there is overlapping relation, i.e. section [SP on R1₁, EP₁] and R2 on section [SP₂, EP₂] overlap；Otherwise it is assumed that R1 and R2 does not have overlapping relation, and again from step 2) Start；

If shared k-mer is positive shared k-mer, this two long reading has positive overlay region；If shared k-mer is Reversely shared k-mer, then two long reading has reversed overlay region side.