CN109935274A - A kind of long reading overlay region detection method based on k-mer distribution characteristics - Google Patents

A kind of long reading overlay region detection method based on k-mer distribution characteristics Download PDF

Info

Publication number
CN109935274A
CN109935274A CN201910156934.7A CN201910156934A CN109935274A CN 109935274 A CN109935274 A CN 109935274A CN 201910156934 A CN201910156934 A CN 201910156934A CN 109935274 A CN109935274 A CN 109935274A
Authority
CN
China
Prior art keywords
mer
shared
node
overlay region
len
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910156934.7A
Other languages
Chinese (zh)
Other versions
CN109935274B (en
Inventor
阎朝坤
罗军伟
张戈
王建林
和婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN201910156934.7A priority Critical patent/CN109935274B/en
Publication of CN109935274A publication Critical patent/CN109935274A/en
Application granted granted Critical
Publication of CN109935274B publication Critical patent/CN109935274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The long reading overlay region detection method based on k-mer distribution characteristics that the present invention relates to a kind of, extracts all k-mer in long reading set, and be deposited into Hash table first;For two specific long readings, their shared every a pair of k-mer are extracted, wherein each shared k-mer is indicated with a binary group;Then a shared k-mer figure, one shared k-mer of each node on behalf in the figure are constructed;Then scheme to determine starting and the final position between the two length readings with the presence or absence of overlay region and overlay region on the two length readings using the shared k-mer.The present invention is easy to use, and accuracy is good;Good overlay region testing result is shown on real data set Escherichia coli, more other overlay region detection methods have in multinomial evaluation index more preferably to be showed.

Description

A kind of long reading overlay region detection method based on k-mer distribution characteristics
Technical field
The present invention relates to the sequence assembling field of bioinformatics, especially a kind of long reading based on k-mer distribution characteristics Number overlay region detection method.
Background technique
In bioinformatics, the groundwork of genome research is that the gene of magnanimity is sequenced, the purpose is to for The hereditary information in organism in genome is obtained, is then applied to gene diagnosis, it is numerous that gene and disease association are analyzed etc. Field.Genome refers to that intracellular all hereditary information, this hereditary information are stored in the form of nucleotide sequence, wherein deoxidation Nucleotide DNA molecule is made of four kinds of nitrogenous bases: adenine (A), thymidine (T), cytimidine (C) and guanine (G).Two Nucleotide constitutes DNA sequence dna according to base pair complementarity, and base A and T, G and C constitute base-pair.Pass through gene order-checking skill Art can obtain base sequence segment in lots of genes group sequence (reading or read).Sequence assembling is obtained by these The process of sequence fragment reduction whole gene group DNA sequence dna.
The chain termination method and Maxam and Gilbert that first generation DNA sequencing technology mainly has Sanger and Coulson to start The chemical method (chain degradation) of invention, Sanger in 1977 determine first genome sequence --- bacteriophage phiX-174, from This mankind obtains the ability for exploring life quintessence, and has and thus really stepped into the genomics epoch for the beginning.Although the Generation sequencing technologies accuracy can reach 99.999%, but its cost is very high, and flux is very low, and large-scale application is faced with It is difficult.Skill is sequenced using 454 technologies of Roche company, the Solexa/HiSeq technology of illumina company as the second generation of representative Art uses polymerase chain amplification (PCR, the Polymerase Chain Reaction) technology and first being sequenced in synthesis It is compared for sequencing technologies, although largely improving sequencing speed, reduces sequencing cost cost, and maintain relatively high Accuracy, but the disadvantage is that introduced PCR process can increase the error rate of sequencing to a certain extent, and there is system to be biased to Property, while reading is also shorter, in order to effectively solve defect present in second generation sequencing technologies, with Pacific Ocean bioscience public affairs Take charge of the real-time single-molecule DNA sequencing (Single Molecule Real Time, SMRT) of (Pacific Biosciences, PB) The Oxford of technology and Oxford Nanotec Solution nanometer single molecule techniques (Oxford Nanopore Technology, ONT) be The third generation sequencing technologies of representative come into being, its essential characteristics are single-molecule sequencings, do not need the process of any PCR, have System mistake caused by GC skewed popularity caused by effect is avoided because of PCR, while reading length is greatly improved, and maintained for two generations The high throughput of sequencing technologies, inexpensive advantage.Only drawback is that third generation sequencing technologies error rate is higher, such as SMRT Single-molecule sequencing technology can generate 13~18% insertion, replacement and missing errors, and need to provide longer overlay length just can be high Effect correction mistake.
Overlay region (overlap) detection is the first step of long reading assembling OLC algorithm, refers to and compares each read Compared with calculating the overlap length two-by-two between reads, pick out the reads of the condition of satisfaction, the standard selected includes overlapping region The minimum value for the specific gravity that minimum length, various sizes of k-mers (base sequence that length is k) set, identical overlapping region account for Deng.The obtained reads for having overlap is generated into an overlay chart, wherein each node is read, each side is connection The part of their overlapping traverses the topology layout that this figure just generates read, i.e. layout in OLC algorithm, and then using should Layout constructs consistent sequence, i.e. the final step consensus of OLC algorithm.Large-scale genome is being assembled using OLC algorithm When, overlap detection is the bottleneck step of sequence assembling process.Overlap detects not the still important step of OLC algorithm, and And it is also often required to and is carried out between reads in from the beginning error correction that long reading assembles (De novo assembly) Overlap detection.The reads of error correction has higher confidence level in overlap step due to the reduction of error rate With simpler effect.Therefore, initial overlap detection is the basis of subsequent assembling step, gene can be assembled correctly and Analysis is vital.
The long reading that third generation sequencing technologies generate can reach tens of thousands of bases, although long reading can be across most of weight Multiple area, but its sequencing error rate is too high, existing long reading overlap detection algorithm is main comprising the following three steps: (1) Find seed sequence (k-mer sequence) common between reads.Since PB and ONT sequencing technologies error rate is relatively high, seed sequence The length of column is generally all shorter;(2) candidate reads is determined.Usually compare the identical seed sequence position of two reads, and And there is consistent distance with other seeds.(3) the identical seed region for detecting candidate reads, determines between two reads Overlapping region, finally determine whether to complete detection by overlap.
MHAP is a kind of MinHash algorithm based on k-mer similitude to realize that the overlap to two reads detects work Tool, MHAP first hash k-mer set using multiple hash functions, establish sketch column with the minimum value of hash function Table, so that it is determined that the median position of overlapping region, then reuses shorter k-mer sequential value and carry out MinHash Algorithm determines final overlapping region, so that it is determined that relatively accurately overlap testing result.Minimap tool is one Overlaper/mapping tool, it combines many algorithms.The function of DALIGNER caching is such as combined, it is first Its minimum program listing is stored in an array, step then is merged to seed again and is ranked up, and Minimap automatically will Data set is divided into several batches, to reduce use of the large data sets on memory.The minimum device of MHAP is used, for minimum Change the effect for repeating k-mer, when selecting min-k-mers, Minimap uses a reversible hash function, prevents hashed value Corresponding to a Sequences of Low Complexity.Minimap finally applies the conllinear chain function of cluster of the matching seed of GraphMap, receives Collect approximate conllinear shared k-mer in all cachings, so that it is determined that last testing result.For each of detecting The overlay information of overlap is with the output of PAF (Pairing mapping format) format.
Currently, the overlap detecting step in the long reading sequence assembling based on third generation sequencing technologies has well The method of solution, still, there are still some shortcomings in some problems, are worth further and explore and study:
(1) sequencing error of genetic fragment, due to the problem of tool itself is sequenced, as SMRT and ONT can generate 15% Random error, it is possible that part is really filtered in the region overlap, is easy last when finding overlapping region Occur a large amount of gap (one section of all N sequence) region when assembling, the subsequent analysis and performance study of gene order are likely to There is very big deviation.
(2) overlay region detection is the maximum step of calculation amount during gene assembling, can effectively reduce calculating cost, It improves computational efficiency and accuracy rate is always the ultimate aim of overlay region detection.
The presence of these problems is the major obstacles that the existing gene order assembling tool of limitation further develops.
Summary of the invention
In view of the above shortcomings of the prior art, the present invention provides a kind of long reading overlay region based on k-mer distribution characteristics Detection method, easy to use, accuracy is high.
In order to solve the above technical problems, the technical scheme adopted by the invention is that: a kind of length based on k-mer distribution characteristics Read overlay region detection method, comprising the following steps:
1) k-mer Hash table is created;
It extracts each length in long reading set and reads included k-mer, wherein k-mer is the substring that length is k; Assuming that the length of a long reading is L, then the quantity of k-mer is (L-k+1) in length reading;Then k-mer Hash is created Table, wherein the key of the table is the cryptographic Hash of k-mer, and value value includes: that the serial number of long reading, k-mer reads in the length Beginning position and compare direction;Then the shared k- that two different long readings have can be quickly found by k-mer Hash table Mer set;
2) a pair of long reading processed not yet of selection from long reading set, and carry out overlay region detection;
2.1) shared k-mer set is calculated;
Assuming that the long reading of this pair are as follows: R1 and R2;If the direction that a k-mer occurs in R1 and R2 is consistent, claim The forward direction that this k-mer is R1 and R2 shares k-mer;If the direction that this k-mer occurs in R1 and R2 is inconsistent, claim The reversed shared k-mer that this k-mer is R1 and R2;The forward direction that they have is found first and shares k-mer, and with count0 table Then the number for showing their positive shared k-mer finds the reversed shared k-mer that they have again, and indicates it with count1 Reversed shared k-mer number;If count0 > count1 and count0 > count, only retain positive shared k- Mer only retains reversely shared k-mer, wherein count=5, remains if count1 > count0 and count1 > count Remaining situation, then it is assumed that the two length readings do not have overlay region, and restart from step 2);
2.2) shared k-mer figure is constructed;
Using the common k-mer of each of R1 and R2 as a node, for two nodes, if between they are on R1 Away from meeting some requirements with their spacing on R2, a line is added, it, then can be with when having handled all nodes two-by-two The shared k-mer figure of building one;
2.3) overlapping region and the digital simulation degree between k-mer node are obtained;
It is shared in k-mer figure at this, obtains maximal connected subgraphs, one shared k- of each of subgraph node on behalf Mer calculates spacing of any two adjacent node on R1 in the subgraph, between then calculating any two node on R2 Away from by comparing the difference between the spacing in the spacing and R2 on R1, the fitting angle value of the subgraph being calculated, if the value Greater than S (S=0.9), then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and again from step 2) Start;
2.4) starting and the final position for whether having overlay region and overlay region between R1 and R2 determined;
Position of the minimum node and maximum node of subgraph node on R1 first, a corresponding region;Subgraph node The position of minimum node and maximum node on R2 also corresponds to a region;Then to the length in the two regions and they Starting and final position judged, it is final to determine between R1 and R2 whether there is overlay region, if there is overlay region, really Fixed starting and final position of the overlay region on R1 and R2;
3) step 2) is returned to, until all readings long two-by-two have detected.
On that basis of the above technical scheme, the step 2.2) specifically:
2.2.1) each shared k-mer is expressed as a binary group;I-th of shared k-mer is indicated between R1 and R2 For (Pi, Qi), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2;Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows1, Q1), (P2, Q2), (P3, Q3) ..., (Pm, Qm);
2.2.2) the shared k-mer between R1 and R2 is ranked up;Compare the Index2 of the Index1 and R2 of R1 first Size, if Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, if Index1 < Index2, then the initial position according to them on R2 carries out incremental arrangement;
2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one Share k-mer and binary group;For sorted sequence node, first from the node (P of left end first1, Q1) conduct Start node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) Any one, then a line is added, side is not otherwise added;After having handled all subsequent nodes of the start node, if side Number be greater than count/2, count is the number of all nodes, then go to step 2.3) carry out subsequent processing;Otherwise, it selects Node (the P of left end second2, Q2) as new start node, the condition for allowing it to be added side with its all subsequent node is sentenced It is disconnected;And so on, until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, And it goes to step 2.3) and carries out subsequent processing;If can not find such start node, then it is assumed that the two length readings do not have Overlay region, and restart from step 2);
Wherein, the condition for judging whether two nodes add side is as follows:
For two node (Pi, Qi) and (Pj, Qj):
If (a) they are positive shared k-mer, meet the following conditions, then the edged between the two nodes:
First condition: Pi< PjAnd Qi< Qj;Second condition: abs ((Pj-Pi)-(Qj-Qi)) < t;Wherein t= 100+min((P2-P1),(Q2-Q1))/5, if t > 500, t=500;
If (b) they are reversely to share k-mer, meet the following conditions, then the edged between the two nodes:
First condition: P1< P2And Q1> Q2;Second condition: abs ((Pj-Pi)-(Qi-Qj)) < t;Wherein t= 100+min((Pj-Pi),(Qi-Qj))/5, if t > 500, t=500.
On that basis of the above technical scheme, the step 2.3) specifically:
2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps;Total conode Number is m, sorted m available in this way shared k-mer;
2.3.2) each shared corresponding position of the k-mer on R1 is respectively P1, P2, P3 ..., and Pm corresponds to position on R2 Set respectively Q1, Q2, Q3 ..., Qm;On R1, calculate between i-th of shared k-mer (i ≠ 1) and its previous shared k-mer Every note Xi=Pi-Pi-1, after having handled all shared k-mer, available m-1 interval, i.e. X1=P2-P1, X2=P3- P2..., Xm-1=Pm-Pm-1;On R2, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer, note are calculated Yi=Qi-Qi-1, after having handled all shared k-mer, available m-1 interval, i.e. Y1=Q2-Q1, Y2=Q3-Q2..., Ym-1=Qm-Qm-1;Calculation formula according to the above interval information, digital simulation degree, the value is as follows:
The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher, wherein n=m- closer to 1 1;If the value is greater than S, wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and Restart from step 2).
On that basis of the above technical scheme, the step 2.4) specifically:
Start node and terminal node are calculated according to subgraph node, it can be in R1On obtain a section [P1, Pm+ k], it can In R2On also obtain a section [Q1, Qm+ k], the two sections are exactly R1And R2Corresponding overlay region;Since mistake is sequenced Presence, the starting of true overlay region and final position may shift with above-mentioned two section, it is assumed that in R1On it is true Area is [SP1, EP1], in R2On true overlay region be [SP2, EP2];R1And R2Length be respectively Len1And Len2
True overlay region starting and final position are obtained using following condition and formula;
If P1> Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=P1-Q1, EP1=Len1, It is SP on R22=1, EP2=Qm+Len1-Pm
If P1<=Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=1, EP1=Len1, in R2 Upper is SP2=Q1-P1, EP2=Qm+Len1-Pm
If P1> Q1And Len1-Pm> Len2-Qm, overlay region is obtained on R1, EP1=Pm+Len2-Qm, SP1=P1- Q1, it is SP on R22=1, EP2=Len2
If P1<=Q1And Len1-Pm> Len2-Qm, obtain overlay region EP on R11=Pm+Len2-Qm, SP1=1, It is SP on R22=Q1-P1, EP2=Len2
By handling to obtain true starting and final position of the overlay region on R1 and R2 above;
From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP1-SP1;Comparison length on R2 is R2OverlapLen=EP2-SP2;In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length: MinOverlapLen=min (R1OverlapLen, R2OverlapLen),
MaxOverlapLen=max
(R1OverlapLen, R2OverlapLen);
a)(EP1-SP1)-(Pm+k-P1) < δ, and (EP2-SP2)-(Qm+ k-Q1) < δ;I.e. final initial position and true Fullsized cannot be greater than δ (δ=500) to initial position difference;
B) (ε=500) MinOverlapLen > ε;
c)That is corresponding comparison siding-to-siding block length cannot differ too big on two long readings;
When R1 and R2 meets above three condition, then claim R1 and R2 that there is overlapping relation, i.e. section on R1 [SP1, EP1] and R2 on section [SP2, EP2] overlap;Otherwise it is assumed that R1 and R2 does not have overlapping relation, and from step 2) Restart;
If shared k-mer is positive shared k-mer, this two long reading has positive overlay region;If shared k- Mer is reversely to share k-mer, then two long reading has reversed overlay region side.
The utility model has the advantages that the long reading overlay region detection method provided by the invention based on k-mer distribution characteristics, simple easy With high-efficient and accuracy is high.
Detailed description of the invention
Fig. 1 is long reading k-mer position comparison chart of one embodiment of the invention a pair with overlap;
Fig. 2 is one embodiment of the invention overlay region exemplary diagram.
Specific embodiment
A kind of long reading overlay region detection method based on k-mer distribution characteristics provided by the invention, specific steps process It is as follows:
One, k-mer Hash table is created.
It extracts each length in long reading set and reads included k-mer.Assuming that the length of a long reading is L, Then the quantity of k-mer is (L-k+1) in length reading.Then k-mer Hash table is created, wherein the key of the table is k-mer Cryptographic Hash, value value include: the serial number of long reading, and k-mer is in the initial position that the length is read and compares direction.Can then it lead to K-mer Hash table is crossed, the shared k-mer set that two different long readings have is quickly found.
Two, a pair of long reading processed not yet of selection from long reading set, and carry out overlay region detection.
2.1) shared k-mer set is calculated.Assuming that the long reading of this pair are as follows: R1 and R2.Finding them first has Forward direction share k-mer, i.e., each positive shared k-mer can be compared in the same direction on this two long readings, and with Count0 indicates the number of their positive shared k-mer.Then find the reversed shared k-mer that they have again, i.e., it is each total The direction for having k-mer to compare onto the two length readings is inconsistent, and of their reversed shared k-mer is indicated with count1 Number.If count0 > count1 and count0 > count, only retain positive shared k-mer, if count1 > Count0 and count1 > count then only retains reversely shared k-mer, wherein count=5.Remaining situation, then it is assumed that The two length readings do not have overlay region, and restart from step 2).
2.2) shared k-mer figure is constructed.Using the common k-mer of each of R1 and R2 as a node.It is saved for two Point adds a line if their spacing on R1 and their spacing on R2 meet some requirements.When having handled All nodes can then construct a shared k-mer figure.
2.2.1) each shared k-mer is expressed as a binary group.I-th of shared k-mer is indicated between R1 and R2 For (Pi, Qi), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2.Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows1, Q1), (P2, Q2), (P3, Q3) ..., (Pm, Qm), as shown in Figure 1.
2.2.2) the shared k-mer between R1 and R2 is ranked up.Compare the Index2 of the Index1 and R2 of R1 first Size, if Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, If Index1 < Index2, incremental arrangement is carried out according to their initial positions on R2.
2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one Share k-mer and binary group.For sorted sequence node, first from the node (P of left end first1, Q1) conduct Start node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) Any one, then a line is added, side is not otherwise added.After having handled all subsequent nodes of the start node, if side Number be greater than count/2, count is the number of all nodes, then go to step 2.3) carry out subsequent processing.Otherwise, it selects Node (the P of left end second2, Q2) as new start node, the condition for allowing it to be added side with its all subsequent node is sentenced It is disconnected.And so on, until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, And it goes to step 2.3) and carries out subsequent processing.If can not find such start node, then it is assumed that the two length readings do not have Overlay region, and restart from step 2).
Wherein, the condition for judging whether two nodes add side is as follows:
If a) they are positive shared k-mer, meet the following conditions, then the two nodes are increased into a line.First A condition: Pi< PjAnd Qi< Qj.Second condition: abs ((Pj-Pi)-(Qj-Qi)) < t wherein t=100+min ((P2- P1),(Q2-Q1))/5, if t > 500, t=500, because there are 20% or so errors for third generation sequencing technologies, remove With 5.
If b) they are reversely to share k-mer, meet the following conditions, then the two nodes is increased into a line.First A condition: P1< P2And Q1> Q2.Second condition: abs ((Pj-Pi)-(Qi-Qj)) < t, wherein t=100+min ((Pj- Pi),(Qi-Qj))/5, if t > 500, t=500.
2.3) overlapping region and the digital simulation degree between k-mer node are obtained.It shares in k-mer figure, obtains at this The most subgraph of node, one shared k-mer of each of subgraph node on behalf.It is adjacent to calculate any two in the subgraph Then spacing of the node on R1 calculates spacing of any two node on R2.By comparing in the spacing and R2 on R1 The fitting angle value of the subgraph is calculated in difference between spacing, if the value is greater than S (S=0.9), continues subsequent processing, Otherwise it is assumed that the two length readings do not have overlay region, and restart from step 2).
2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps.Total conode Number is m, sorted m available in this way shared k-mer.
2.3.2) by 2.3.1) this is known to including that m relevance shares more by force k-mer on reads, as shown in Figure 1, k- Corresponding position of the mer on R1 is respectively P1, P2, P3..., Pm, corresponding position is respectively Q on R21, Q2, Q3..., Qm.In R1 On, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, remembers Xi=Pi-Pi-1, all when having handled After shared k-mer, available m-1 interval, i.e. X1=P2-P1, X2=P3-P2..., Xm-1=Pm-Pm-1;On R2, calculate Y is remembered at the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-meri=Qi-Qi-1, when having handled all shared k- After mer, available m-1 interval, i.e.,
Y1=Q2-Q1, Y2=Q3-Q2..., Ym-1=Qm-Qm-1.According to the above interval information, the power of the subgraph node is calculated Value, the calculation formula of the weight are as follows:
The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher closer to 1, wherein n=m- 1.If the value is greater than S, wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and Restart from step 2).
2.4) starting and the final position for whether having overlay region and overlay region between R1 and R2 determined.Subgraph section first Position of the minimum node and maximum node of point on R1, a corresponding region.The minimum node and maximum node of subgraph node Position on R2 also corresponds to a region.Then the length to the two regions and their starting and final position into Row judgement, it is final to determine between R1 and R2 whether there is overlay region, if having overlay region, it is determined that the overlay region is in R1 and R2 On starting and final position.
Start node and terminal node are calculated according to node, as shown in Fig. 2, a section [P can be obtained on R11, Pm+ K], a section [Q can be also obtained on R21, Qm+k].The two sections are exactly overlay region corresponding to R1 and R2.Due to surveying The presence of sequence mistake, true overlay region starting and final position may shift with above-mentioned two section, it is assumed that on R1 True area be [SP1, EP1], the true overlay region on R2 is [SP2, EP2].Assuming that the length of R1 and R2 be respectively Len1 and Len2。
We obtain true overlay region starting and final position using following condition and formula.
If P1> Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=P1-Q1, EP1=Len1, It is SP on R22=1, EP2=Qm+Len1-Pm
If P1<=Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=1, EP1=Len1, in R2 Upper is SP2=Q1-P1, EP2=Qm+Len1-Pm
If P1> Q1And Len1-Pm> Len2-Qm, overlay region is obtained on R1, SP1=P1-Q1, EP1=Pm+Len2- Qm, it is SP on R22=1, EP2=Len2
If P1<=Q1And Len1-Pm> Len2-Qm, obtain overlay region EP on R11=Pm+Len2-Qm, SP1=1, It is SP on R22=Q1-P1, EP2=Len2
By handling to obtain true starting and final position of the overlay region on R1 and R2 above.
(if EP1-SP1)-(Pm+k-P1) < δ, and (EP2-SP2)-(Qm+ k-Q1) < δ.I.e. final initial position with The true initial position difference that compares cannot be greater than δ, it is assumed that δ=500.
From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP1-SP1;Comparison length on R2 is R2OverlapLen=EP2-SP2.In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length MinOverlapLen=min (R1OverlapLen, R2OverlapLen), MaxOverlapLen=max (R1OverlapLen, R2OverlapLen).The reads for meeting overlay region also needs to meet the following conditions first:
A) minimum value, that is, MinOverlapLen the > ε, ε for really comparing length is an overlapping region threshold value of setting, is write from memory Recognizing ε=500 can also be other values.
B) the correspondence section of two reads cannot differ too big: i.e. MinOverlapLen*3 > MaxOverlapLen*2.
Compare if it is forward direction: the direction of two read is consistent, compares that lap position is end to end or a read quilt Another read is all covered.Compare if it is reversed: two read's is contrary, i.e., wherein a read has been carried out reversely Complementary operation, the lap position of comparison are all covered in the stem of read or tail portion or a read by another read Lid.
Three, it is reads pairs remaining to continue cycling through detection, until all being compared two-by-two between all reads.
Four, experimental verification
4.1 data sets and evaluation index
In order to verify the validity of this method, this method is at species Escherichia coli (Escherichia coli/E.coli) On long readings collection on tested, and be compared analysis with currently a popular other two kinds of overlay region detection methods. It is that two different third generation sequencing technologies (survey in real time by big flat foreign sequencing company unimolecule using two kinds long readings collection Sequence technology and Oxford nano-pore sequencing technology) it obtains, be respectively: ReadSet1 and ReadSet2.The detailed letter of long reading set Breath is shown in Table 1.
The long readings collection of table 1
2 evaluation index of table
Note: True Positive, TP: true classification is positive example, and prediction classification is positive example;False Positive, FP: True classification is negative example, and prediction classification is positive example;False Negative, FN: true classification is positive example, and prediction classification is negative Example;True Negative, TN: true classification is negative example, and prediction classification is negative example.
In order to evaluate the accuracy of overlay region detection method, Blasr tool is used first, and the comparison of reads data set is arrived With reference on gene, according to comparison result, read-around ratio is calculated to initial position and end position on reference gene, to calculate reading Overlapping region between number and reading, determines the overlap result of standard.And result and other methods and this method are compared Compared with comparison result is evaluated from many aspects, wherein main evaluation index is shown in Table 2.
4.2, the comparison between Overlap detection method
This method is detected as standard and other two popular sides overlap with the Overlap that Blasr tool generates Method compares, both overlap detection methods include: MHAP and minimap, and this method is named as FLRO.Specific ratio 3 and table 4 are shown in Table compared with evaluation result.
Table 3 detects evaluation result based on the overlap of the long reading set of the real-time sequencing technologies of unimolecule
4.2.1, the overlap based on the long reading set of the real-time sequencing technologies of unimolecule detects evaluation result
Overlap detection, evaluation knot are carried out first with the long reading set obtained by the real-time sequencing technologies of unimolecule Fruit is shown in Table 3.It will be seen that our method accuracy rate and other methods is equally matched on all data sets, commenting Obvious high and other two methods in valence index recall rate illustrate that this method can more find out correct matched overlap, But detection number is more, and weak tendency is just in precision, but on overall target F-Measure, the index highest of this method, Effect is best.
4.2.2, the overlap based on Oxford nano-pore sequencing Chief Technology Officer reading set detects evaluation result
We carry out overlap detection followed by the long reading set obtained by Oxford nano-pore sequencing technology, comment Valence the results are shown in Table 4.It will be seen that our method will be substantially better than other in many index and refer on data set Mark, especially compares result with MHAP and exists, be higher by 24% and 32% respectively in recall rate and overall target F-Measure.
Table 4 detects evaluation result based on the overlap of Oxford nano-pore sequencing Chief Technology Officer reading set

Claims (4)

1. a kind of long reading overlay region detection method based on k-mer distribution characteristics, which comprises the following steps:
1) k-mer Hash table is created;
It extracts each length in long reading set and reads included k-mer, wherein k-mer is the substring that length is k;Assuming that The length of one long reading is L, then the quantity of k-mer is (L-k+1) in length reading;Then k-mer Hash table is created, In the table key be k-mer cryptographic Hash, value value includes: the serial number of long reading, k-mer in the initial position that the length is read With comparison direction;Then the shared k-mer set that two different long readings have can be quickly found by k-mer Hash table;
2) a pair of long reading processed not yet of selection from long reading set, and carry out overlay region detection;
2.1) shared k-mer set is calculated;
Assuming that the long reading of this pair are as follows: R1 and R2;If the direction that a k-mer occurs in R1 and R2 is consistent, claim this The forward direction that k-mer is R1 and R2 shares k-mer;If the direction that this k-mer occurs in R1 and R2 is inconsistent, claim this The reversed shared k-mer that k-mer is R1 and R2;The forward direction that they have is found first and shares k-mer, and it is indicated with count0 Positive shared k-mer number, then find the reversed shared k-mer that they have again, and indicate them instead with count1 To the number of shared k-mer;If count0 > count1 and count0 > count, only retain positive shared k-mer, If count1 > count0 and count1 > count, only retain reversely shared k-mer, wherein count=5, remaining Situation, then it is assumed that the two length readings do not have overlay region, and restart from step 2);
2.2) shared k-mer figure is constructed;
Using the common k-mer of each of R1 and R2 as a node, for two nodes, if their spacing on R1 and Their spacing on R2 meet some requirements, and add a line, when having handled all nodes two-by-two, then can construct One shared k-mer figure;
2.3) overlapping region and the digital simulation degree between k-mer node are obtained;
It shares in k-mer figure at this, obtains maximal connected subgraphs, one shared k-mer of each of subgraph node on behalf, Spacing of any two adjacent node on R1 in the subgraph is calculated, spacing of any two node on R2 is then calculated, leads to The difference compared between the spacing in spacing and R2 on R1 is crossed, the fitting angle value of the subgraph is calculated, if the value is greater than S (S=0.9), then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and restart from step 2);
2.4) starting and the final position for whether having overlay region and overlay region between R1 and R2 determined;
Position of the minimum node and maximum node of subgraph node on R1 first, a corresponding region;The minimum of subgraph node The position of node and maximum node on R2 also corresponds to a region;Then the length to the two regions and their Begin and final position is judged, it is final to determine between R1 and R2 whether there is overlay region, if having overlay region, it is determined that should Starting and final position of the overlay region on R1 and R2;
3) step 2) is returned to, until all readings long two-by-two have detected.
2. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.2) specifically:
2.2.1) each shared k-mer is expressed as a binary group;I-th of shared k-mer is expressed as (P between R1 and R2i, Qi), Pi is the shared initial position of the k-mer on R1, and Qi is the shared initial position of the k-mer on R2;Assuming that R1 and R2 has m shared k-mer, and this m binary group (P as follows1, Q1), (P2, Q2), (P3, Q3) ..., (Pm, Qm);
2.2.2) the shared k-mer between R1 and R2 is ranked up;Compare the Index2 size of the Index1 and R2 of R1 first, If Index1 > Index2, all shared k-mer carry out incremental arrangement in the initial position of R1 according to them, if Index1 < Index2, then the initial position according to them on R2 carries out incremental arrangement;
2.2.3 a node) is established to each shared k-mer between the two length readings, i.e., each node is one shared K-mer and binary group;For sorted sequence node, first from the node (P of left end first1, Q1) as starting Node allows it to be judged one by one with its subsequent all node, if meeting the condition a) that side is added below or b) any One, then a line is added, side is not otherwise added;After having handled all subsequent nodes of the start node, if on side Number is greater than count/2, and count is the number of all nodes, then goes to step 2.3) and carry out subsequent processing;Otherwise, left end is selected Second node (P2, Q2) as new start node, allow it to be added the condition judgement on side with its all subsequent node;With This analogizes, and until finding a start node, until the number on the side that it is connected with its subsequent node is greater than count/2, and turns Subsequent processing is carried out to step 2.3);If can not find such start node, then it is assumed that the two length readings do not have overlapping Area, and restart from step 2);
Wherein, the condition for judging whether two nodes add side is as follows:
For two node (Pi, Qi) and (Pj, Qj):
If (a) they are positive shared k-mer, meet the following conditions, then the edged between the two nodes:
First condition: Pi< PjAnd Qi< Qj;Second condition: abs ((Pj-Pi)-(Qj-Qi)) < t;Wherein t=100+ min((P2-P1),(Q2-Q1))/5, if t > 500, t=500;
If (b) they are reversely to share k-mer, meet the following conditions, then the edged between the two nodes:
First condition: P1< P2And Q1> Q2;Second condition: abs ((Pj-Pi)-(Qi-Qj)) < t;Wherein t=100+ min((Pj-Pi),(Qi-Qj))/5, if t > 500, t=500.
3. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.3) specifically:
2.3.1 a start node, and the node connected with it can) be obtained according to above-mentioned steps;The number of total conode For m, sorted m available in this way shared k-mer;
2.3.2) each shared corresponding position of the k-mer on R1 is respectively P1, P2, P3 ..., Pm, the corresponding position point on R2 Not Wei Q1, Q2, Q3 ..., Qm;On R1, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, Remember Xi=Pi-Pi-1, after having handled all shared k-mer, available m-1 interval, i.e. X1=P2-P1, X2=P3-P2..., Xm-1=Pm-Pm-1;On R2, the interval of i-th shared k-mer (i ≠ 1) and its previous shared k-mer is calculated, remembers Yi=Qi- Qi-1, after having handled all shared k-mer, available m-1 interval, i.e. Y1=Q2-Q1, Y2=Q3-Q2..., Ym-1=Qm- Qm-1;Calculation formula according to the above interval information, digital simulation degree, the value is as follows:
The result illustrates that a possibility that k-mer node in the subgraph corresponds to overlay region is higher, wherein n=m-1 closer to 1;Such as The fruit value is greater than S, and wherein S=0.9, then continue subsequent processing, otherwise it is assumed that the two length readings do not have overlay region, and from step It is rapid 2) to restart.
4. the long reading overlay region detection method according to claim 1 based on k-mer distribution characteristics, which is characterized in that The step 2.4) specifically:
Start node and terminal node are calculated according to subgraph node, it can be in R1On obtain a section [P1, Pm+ k], it can be in R2 On also obtain a section [Q1, Qm+ k], the two sections are exactly R1And R2Corresponding overlay region;Due to depositing for sequencing mistake The starting of true overlay region and final position may shift with above-mentioned two section, it is assumed that in R1On true area be [SP1, EP1], in R2On true overlay region be [SP2, EP2];R1And R2Length be respectively Len1And Len2
True overlay region starting and final position are obtained using following condition and formula;
If P1> Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=P1-Q1, EP1=Len1, on R2 For SP2=1, EP2=Qm+Len1-Pm
If P1<=Q1And Len1-Pm<=Len2-Qm, obtain overlay region SP on R11=1, EP1=Len1, it is on R2 SP2=Q1-P1, EP2=Qm+Len1-Pm
If P1> Q1And Len1-Pm> Len2-Qm, overlay region is obtained on R1, EP1=Pm+Len2-Qm, SP1=P1-Q1, It is SP on R22=1, EP2=Len2
If P1<=Q1And Len1-Pm> Len2-Qm, obtain overlay region EP on R11=Pm+Len2-Qm, SP1=1, in R2 Upper is SP2=Q1-P1, EP2=Len2
By handling to obtain true starting and final position of the overlay region on R1 and R2 above;
From the foregoing, it will be observed that the comparison length on R1 is R1OverlapLen=EP1-SP1;Comparison length on R2 is R2OverlapLen=EP2-SP2;In the Maximum overlap length that two compare on section, it is respectively with minimum overlay length: MinOverlapLen=min (R1OverlapLen, R2OverlapLen), MaxOverlapLen=max (R1OverlapLen, R2OverlapLen);
a)(EP1-SP1)-(Pm+k-P1) < δ, and (EP2-SP2)-(Qm+ k-Q1) < δ;I.e. final initial position and true ratio δ (δ=500) cannot be greater than to initial position difference;
B) (ε=500) MinOverlapLen > ε;
c)That is corresponding comparison siding-to-siding block length cannot differ too big on two long readings;
When R1 and R2 meets above three condition, then claim R1 and R2 that there is overlapping relation, i.e. section [SP on R11, EP1] and R2 on section [SP2, EP2] overlap;Otherwise it is assumed that R1 and R2 does not have overlapping relation, and again from step 2) Start;
If shared k-mer is positive shared k-mer, this two long reading has positive overlay region;If shared k-mer is Reversely shared k-mer, then two long reading has reversed overlay region side.
CN201910156934.7A 2019-03-01 2019-03-01 Long reading overlap region detection method based on k-mer distribution characteristics Active CN109935274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156934.7A CN109935274B (en) 2019-03-01 2019-03-01 Long reading overlap region detection method based on k-mer distribution characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156934.7A CN109935274B (en) 2019-03-01 2019-03-01 Long reading overlap region detection method based on k-mer distribution characteristics

Publications (2)

Publication Number Publication Date
CN109935274A true CN109935274A (en) 2019-06-25
CN109935274B CN109935274B (en) 2021-04-30

Family

ID=66986156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156934.7A Active CN109935274B (en) 2019-03-01 2019-03-01 Long reading overlap region detection method based on k-mer distribution characteristics

Country Status (1)

Country Link
CN (1) CN109935274B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326210A (en) * 2020-03-11 2020-06-23 中国科学院生态环境研究中心 Primer design method and system based on k-mer algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
CN107841542A (en) * 2016-09-19 2018-03-27 深圳华大基因科技服务有限公司 A kind of generation sequence assemble method of genome contig two and system
CN108830047A (en) * 2018-06-21 2018-11-16 河南理工大学 A kind of scaffolding method based on long reading and contig classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
CN107841542A (en) * 2016-09-19 2018-03-27 深圳华大基因科技服务有限公司 A kind of generation sequence assemble method of genome contig two and system
CN108830047A (en) * 2018-06-21 2018-11-16 河南理工大学 A kind of scaffolding method based on long reading and contig classification

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BILAL WAJID,ERCHIN SERPEDIN: ""Review of General Algorithmic Features for Genome Assemblers for Next Generation Sequencers"", 《GENOMICS, PROTEOMICS & BIOINFORMATICS 》 *
JUSTIN CHU,HAMID MOHAMADI: ""Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art"", 《BIOINFORMATICS》 *
陆赢: ""芥菜型油菜A9染色体物理图谱构建及重要基因研究"", 《中国博士学位论文全文数据库农业科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326210A (en) * 2020-03-11 2020-06-23 中国科学院生态环境研究中心 Primer design method and system based on k-mer algorithm
CN111326210B (en) * 2020-03-11 2023-07-14 中国科学院生态环境研究中心 Primer design method and system based on k-mer algorithm

Also Published As

Publication number Publication date
CN109935274B (en) 2021-04-30

Similar Documents

Publication Publication Date Title
US20210217490A1 (en) Method, computer-accessible medium and system for base-calling and alignment
US10839940B2 (en) Method, computer-accessible medium and systems for score-driven whole-genome shotgun sequence assemble
Myers Whole-genome DNA sequencing
US20150302144A1 (en) Hierarchical genome assembly method using single long insert library
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
US20150178446A1 (en) Iterative clustering of sequence reads for error correction
Luo et al. Computational approaches for transcriptome assembly based on sequencing technologies
CN109935274A (en) A kind of long reading overlay region detection method based on k-mer distribution characteristics
US20230044570A1 (en) Method for determining a measure correlated to the probability that two mutated sequence reads derive from the same sequence comprising mutations
US20220243267A1 (en) Compositions and methods related to quantitative reduced representation sequencing
Rachappanavar et al. Analytical Pipelines for the GBS Analysis
Sundararajan et al. Chaining algorithms for alignment of draft sequence
RU2799778C9 (en) Method for determining indicator correlated with probability that two mutated sequence readings are from the same sequence containing mutation
RU2799778C1 (en) Method for determining indicator correlated with probability that two mutated sequence readings are from the same sequence containing mutation
CN115602246B (en) Sequence alignment method based on group genome
Rescheneder Fast, accurate and user-friendly alignment of short and long read data with high mismatch rates
Indumathy et al. Nature inspired algorithms to solve DNA fragment assembly problem: a survey
Espinosa García et al. Advancements in long-read genome sequencing technologies and algorithms
Weese et al. DNA-Seq Error Correction Based on Substring Indices
Tanaka et al. cycle_finder: de novo analysis of tandem and interspersed repeats based on cycle-finding
Sommer Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data
Baaijens De novo approaches to haplotype-aware genome assembly
Chandramohan et al. A Multi-Platform de novo Genome Assembly and Comparative Analysis for the Vibrio cholerae-HC-63A
Durai Novel graph based algorithms for transcriptome sequence analysis
Bolognini Unraveling tandem repeat variation in personal genomes with long reads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant