US20170199960A1

US20170199960A1 - Systems and methods for adaptive local alignment for graph genomes

Info

Publication number: US20170199960A1
Application number: US14/990,323
Authority: US
Inventors: Kaushik Ghose; Wan-Ping Lee
Original assignee: Seven Bridges Genomics Inc
Current assignee: Seven Bridges Genomics Inc
Priority date: 2016-01-07
Filing date: 2016-01-07
Publication date: 2017-07-13
Also published as: US20240096450A1; WO2017120128A1; US20200058374A1; US11810648B2

Abstract

Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.

Description

FIELD OF THE INVENTION

The invention relates to systems and methods for adaptive local alignment for graph genomes.

BACKGROUND

A person's genetic information has the potential to reveal much about their health and life. A risk of cancer or a genetic disease may be revealed by the sequences of the person's genes, as well as the possibility that his or her children could inherit a genetic disorder. Genetic information can also be used to identify an unknown organism, such as potentially infectious agents discovered in samples from public food or water supplies. Next-generation sequencing (NGS) technologies are available that can sequence entire genomes quickly. Some approaches to analyzing sequence reads involve mapping the sequence reads to a reference genome. Mapping reads to a reference can be done by aligning each read to a relevant portion of the reference.

SUMMARY

The described methods and systems provide a method for automatically analyzing sequence data using a graph, such as a reference directed acyclic graph (DAG), and accounting for the various difficulties (e.g., complexities) of sequence alignment and genomic variation without requiring intermediate input or guidance from a user. Essentially, the described methods and systems simplify and optimize the process for sequence alignment between a sample sequence and a reference genome because the systems and methods function autonomously from the user (after the sequence information is received). A processor can, for example, identify and align the sequence information in a manner that meets a user's requirements for accuracy and expediency. In some cases, the sequence information can be directly transmitted to the systems described herein, which helps to provide a vastly simplified sequence alignment experience for the user.
Representing genomes as graphs, such as directed acyclic graphs (DAGs), is a powerful tool for coping with the complexities of sequence alignment and genomic variation. In contrast to a single linear reference sequence, a DAG can incorporate myriad types of information about genomic variations, such as single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and larger structural variants (SVs). Each of these variations can be represented by following different branches through the DAG. Additionally, using DAGs can help to increase the probability of identifying small variations near large known structural variants because the presence of the known structural variants in the graph improves the quality of the alignment. Thus, known variations can be reliably accounted for and identified by aligning reads containing the known variations to a sequence path that includes those variations.
However, the complex nature of DAGs can result in a combinatorial explosion when used with conventional linear alignment tools thereby rendering alignment to a DAG computationally expensive and cost prohibitive. The methods and systems described throughout this document can automatically and efficiently analyze sequence data using the reference DAG even when doing so would conventionally be undesirable (e.g., computationally expensive, time consuming, imprecise, and/or cost prohibitive). The systems and methods can analyze the difficulty of aligning the sequence data with the reference DAG based on a predicted alignment difficulty (e.g., due to the DAG complexity, sequence length, total processing time, remaining processing time, or any combination thereof) between the reference DAG and the sequence data. By choosing and employing a selected alignment algorithm based on this predicted alignment difficulty, the methods and systems can efficiently analyze sequence data.
In one aspect of the invention, a method includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph including nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment includes a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the method further includes identifying the mapped position of the sequence read within the reference genome.
In another aspect of the invention, the system for determining a subject's genetic information, the system includes a computer system comprising a processor coupled to memory and operable to: receive identities of a plurality of nucleotides at known locations on a reference genome; receive sequence reads from a sample from a subject; and map the sequence reads to the reference genome, thereby identifying a corresponding location on the reference genome. The mapping includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment includes a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the system identifies the mapped position of the sequence read within the reference genome.
Implementations can include one or more of the following features.
In some implementations, the method further includes performing, by means of the computer system, a global alignment for the sequence read based on whether the global alignment is advanced or basic. The advanced global alignment includes a first-global-alignment algorithm, and the basic global alignment includes alignment includes a second-global-alignment algorithm. The global alignment provides information indicative of the plurality of candidate mapping positions.
In certain implementations, the first-local-alignment algorithm is different from the second-local-alignment algorithm.
In some implementations, the first-global-alignment algorithm is different from the second-global-alignment algorithm.
In certain implementations, at least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
In some implementations, the method further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
In certain implementations, a complex graph includes 10 or more nodes.
In some implementations, a simple graph includes 5 or fewer nodes.
In certain implementations, the first-local-alignment algorithm includes a pattern matching algorithm.
In some implementations, the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm.
In certain implementations, the second-local-alignment algorithm includes a linear alignment algorithm.
In some implementations, each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the reference genome.
In certain implementations, candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
In some implementations, the method further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
In certain implementations, each of the plurality of candidate mapping positions defines a selected path through the graph.
In some implementations, the processor is further operable to: perform a global alignment, by means of the computer system, for the sequence read based on whether the global alignment is advanced or basic. The advanced global alignment includes a first-global-alignment algorithm, and the basic global alignment includes a second-global-alignment algorithm. The global alignment provides information indicative of the plurality of candidate mapping positions.
In certain implementations, each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the candidate mapping position.
In some implementations, the complexity of the graph varies with a remaining processing time estimate to reduce a total processing time.
In certain implementations, the first-local-alignment algorithm is different from the second-local-alignment algorithm.
In some implementations, the first-global-alignment algorithm is different from the second-global-alignment algorithm.
In certain implementations, at least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
In some implementations, the mapping further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
In certain implementations, a complex graph includes 10 or more nodes.
In some implementations, a simple graph includes 5 or fewer nodes.
In certain implementations, the first-local-alignment algorithm includes a pattern matching algorithm.
In some implementations, the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm and a Tarhio-Ukkonen algorithm.
In certain implementations, the second-local-alignment algorithm includes a linear alignment algorithm.
In some implementations, the candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
In certain implementations, the mapping further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
In some implementations, each of the plurality of candidate mapping positions defines a selected path through the graph.
Implementations can include one or more of the following advantages.
Many of the methods described herein can account for the complexity of a reference graph by adjusting the alignment methods based on the complexity. Regions of a reference graph with a large number of nodes will typically result in an exponential number of calculations, which leads to excessive cost and processing time during the alignment process. Other sequence alignment methods, such as the Graph Smith-Waterman algorithm, have been adapted to accommodate this complexity and to work with DAGs (e.g., as disclosed in U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference); however, these methods have a high initial computational cost and therefore are not amenable for use with large data sets. By adjusting the alignment methods based on the complexity of the reference DAG, an optimal alignment method can be used. For example, if genomic information is being aligned with a relatively simple reference graph, a linear alignment method can be used to save resources (e.g., time and computational costs). In other cases, if the reference graph is relatively complex, other sequence alignment methods, such as the modified Graph Smith-Waterman algorithm may be an alignment tool well-suited to the task despite the high initial computational costs.
In some implementations, the predicted alignment difficulty can vary during the alignment process. The alignment techniques described herein can continually assess the alignment difficulty and dynamically adapt the alignment tools in response to any changes. For example, two or more alignment tools can be used during the same alignment process.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 diagrams a method for aligning sequence reads to a reference genome.

FIG. 2 diagrams a method for performing a global alignment of a sequence read to a reference genome.

FIG. 3 shows the matrices that represent the comparison.

FIG. 4 illustrates finding alignments between a sequence and a reference DAG.

FIG. 5 shows the use of an adjacency list for each vertex and edge in a DAG.

FIG. 6 shows an exemplary DAG and sequence read.

FIG. 7 shows an exemplary DAG representing a single nucleotide polymorphism (C or D) and an insertion (EG or EFG).

FIG. 8 shows an exemplary DAG representing a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions.

FIG. 9 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware alignment algorithm (such as a Graph Smith-Waterman alignment method).

FIG. 10 illustrates a system for implementing the methods diagramed in FIGS. 1 and 2.

DETAILED DESCRIPTION

In general, the methods and systems described herein relate to determining a subject's genetic information. Information representing at least a portion of the subject's DNA (e.g., a sequence read) is globally aligned with a graph, such as a reference genomic directed acyclic graph (DAG), to identify candidate paths or mapping positions through the genomic DAG. The genomic information can then be compared against the candidate paths with a higher degree of accuracy as compared to the original global alignment. This two-step alignment analysis allows for the selection of an appropriate alignment algorithm based on the alignment difficulty and an acceptable alignment tolerance, thereby reaping considerable computational savings resources because resources are not needlessly expended.
Additionally, the present disclosure recognizes, in certain embodiments, that the problem of combinatorial explosion associated with aligning sequence data to genomic DAGs is solved by an alignment method that analyzes the complexity of a DAG and chooses an alignment algorithm based on the complexity. For example, sequence reads can be quickly aligned to portions of a DAG that have low complexity using a pattern matching algorithm or a linear alignment algorithm. However, for complex portions of the DAG, the number of paths that must be analyzed by a pattern matching or linear alignment algorithm grows exponentially, leading to a corresponding exponential decrease in performance of the pattern matching or linear alignment tool. In these cases, a graph-aware algorithm may be substituted. The graph-aware algorithm may have a high initial overhead cost, but its decrease in performance scales linearly as the complexity of the graph increases, making it a preferred choice for complex regions of a DAG.
The following describes methods for indexing collections of data strings represented in a form of DAGs and performing efficient string search including exact and approximate matching techniques. Among other applications, these methods can be used to quickly and efficiently find genomic data sequences (e.g., reads produced by DNA sequencing machines) in genome graphs representing genome data variations.
FIG. 1 diagrams a method 101 for analyzing genomic sequence information. At step 102 sequence reads from a sample from a subject are received. Sequence reads can be received from a nucleic acid sequencing instrument, such as an Illumina® HiSeq2500, a Roche 454 GS FLX+, an Ion Torrent PGM™, and the like. A processor coupled to the tangible memory device is used to globally align 104 the sequence read against the reference genome to identify 106 one or more candidate mapping positions. These candidate mapping positions represent acceptable alignments between the received sequence reads and the paths through the reference DAG (e.g., a location within the reference genome that could be related to the sequence read). In some cases, a report can be provided that identifies a one or more of the candidate mapping positions.
Next-generation sequencing and alignment can be computationally expensive. For example, over 900 million paired-end 100 bp sequence reads are needed to obtain 30× coverage of the human genome. As will be described in further detail below, certain algorithms, such as various pattern matching and linear alignment algorithms, scan each sequence read over a reference sequence to identify an exact, optimal, or best-fit match. But this approach is typically computationally infeasible given the scales associated with next-generation sequencing. To solve this problem, a heuristic approach may be used to reduce the total number of computations required. For example, one approach is to separate an alignment into “coarse” and “fine” stages. In the coarse stage, a fast, yet often incomplete or inexact alignment is performed to identify a number of candidate regions to which a particular sequence read aligns. This is referred to as a global alignment or a global search because the read has aligned considering the entirety of the reference sequence. In the second, fine stage, each candidate region is subsequently analyzed using a more computationally expensive, yet exact algorithm. This is referred to as a local alignment or local search because the alignment is restricted to the local space identified during the coarse stage. In certain examples, this may be referred to as a “seed and extend” strategy; the global alignment yields a number of seed positions, which are then extended during the local alignment stage. The candidate positions may then be scored and ranked, yielding the best aligned position for each sequence read.
For example, one approach for performing a global alignment 104 is to place many small sections, or k-mers, of the reference DAG into a computer hash function. The hash function calculates an index as a function of the k-mer. The index identifies an entry in a hash table, and the positions of the k-mer within the reference DAG are stored in that entry, i.e., in the entry indexed by the hash of the k-mer. A sequence read is analyzed by hashing some or all of its k-mers and the corresponding entries of the DAG hash table are accessed to read positions within the DAG where those sequence read k-mers can be found. Sections of the DAG where a threshold number of k-mer positions are found are identified as candidate regions within the represented genomes for alignment or mapping of the sequence read. By the described implementation, a sequence read can be mapped to a large number of “good fit” positions within a reference genome very rapidly.
A sequence read can be quickly mapped to a reference using a global alignment algorithm because hash values can be retrieved rapidly (i.e., without iterating over entries in an array). Since a DAG can store a great amount of genetic variation, a sequence read can be mapped to an appropriate portion of one or more certain genomes from among numerous candidate genomes very rapidly. Since sequence reads can be mapped to biologically relevant reference genomes very rapidly, a patient's genetic information can be discerned, or genetic sequences can be identified quickly even against a background of all genetic variety represented in the system. For example, aligning a patient's sequence reads to a DAG representing variations in cancer genomes (such as those from The Cancer Genome Atlas) can be used to quickly identify particular mutations associated with cancer present in the patient.
Once a plurality of candidate mapping positions has been identified 106 by a global searching algorithm 104, a subsection of the graph at each candidate mapping position is identified. In some examples, the subsection of the graph includes at least the number of base pairs in the sequence read for searching (e.g., 50, 100, 200 bp). However, longer subsections can be preferable for local search. Using longer subsections is particularly advantageous when using paired-end or mate-pair sequence reads, which are pairs of sequence reads describing the 5′ and 3′ ends of a single DNA molecule, often with an inferred insert size distance of 300-600 (paired-end) or 2,000-5,000 (mate pair) base pairs between the 5′ and 3′ sequence reads. Accordingly, the portions or subsections of the graph for each of the candidate mapping positions can be a variety of sizes ranging from 50 to even 10,000 base pairs.
The processor can then perform 108 a local alignment for each of the candidate mapping positions. With the candidate mapping positions (e.g., “good fit” positions) identified, a processor can perform 108 a local alignment at each of the candidate mapping positions to narrow the candidate mapping positions to identify one or more mapped positions (e.g., “better fit” or “best fit” positions). A processor can start 110 the local alignment process and then determine 112 whether an alignment with the reference DAG surrounding the candidate mapping positions (e.g., the subset of the reference DAG surrounding a candidate mapping position) is advanced or basic. In some examples, an alignment is advanced if the complexity of the subset of the reference DAG exceeds 5 paths. In other examples, the alignment is considered basic if the complexity of the subset of the reference DAG is 5 paths or less. If the alignment is advanced 114, then the processor can use a first alignment tool implementing a first method 116. If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 118. The processor can repeat 120 this process until all candidate mapping positions are locally aligned. The processor can then rank 122 the candidate mapping positions to identify the most likely alignment positions corresponding to the received sequence read before ending 124 the local alignment process. Based on these rankings, the processor can identify 126 the mapped position within an acceptable margin of error.
While the method 101 comprises performing 104 a global alignment against a reference DAG, in certain embodiments, a global alignment may be simply performed against a linear reference sequence. This can be advantageous in certain embodiments where using a linear reference sequence for the global alignment can help improve performance. However, the local alignment 108 may still consider a reference DAG structure. Various embodiments are considered to be within the scope of the disclosure.
In certain embodiments, a global alignment may also consider whether an alignment with the DAG is advanced or basic and select an appropriate algorithm accordingly. FIG. 2 illustrates an exemplary global alignment process 201 for identifying candidate mapping positions for a sequence read. As with the local alignment process 108, the global alignment process 201 can evaluate the reference DAG to determine if the global alignment is advanced or basic. For example, this may be performed in situations wherein the reference DAG is small enough such that pattern matching and linear alignment algorithms may be used for the global alignment stage. For example, a processor can start 202 the local alignment process and then determine 204 whether an alignment with the reference DAG and the sequence read is advanced or basic. In some examples, an alignment is advanced if the complexity of the DAG exceeds 6 paths, as discussed below. In other examples, the alignment is basic if the complexity of the DAG is 6 paths or less. If the alignment is advanced 206, then the processor can use a first alignment tool implementing a first method 208. If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 210. The first method 208 and the second method 210 may be the same, different, or partially overlap. In some examples, the first method 208 and/or the second method 210 can correspond to at least one of the first method 116 or the second method 118. Using these alignment tools, the processor can identify 212 the candidate mapping positions. The processor can then repeat this process until the global alignment is complete 214. The processor can then end 216 the global alignment process 201.
If the alignment is basic 118, the second alignment tool may comprise linear alignment methods, such as optimal alignment or pattern matching algorithms. For example, the second alignment tool may comprise an optimal alignment algorithm, such as a Smith-Waterman algorithm, a Needleman-Wunsch algorithm, and the like. These algorithms use a dynamic programming approach in order to find an optimal local alignment between two sequences, but are computationally expensive. To improve performance, pattern matching or string matching algorithms, such as a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm may be used. These algorithms are fast, but typically do not handle variations (such as indels and SNPs) as well as an optimal alignment algorithm. However, pattern matching and string matching algorithms are particularly well suited for aligning sequence reads against graph genomes, such as DAGs. As known variations are already present in the graph, the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read; it need only map the sequence read to the most likely branch within the DAG having that variation.
While pattern matching and linear alignment algorithms can function as a fast local alignment tool, each of these methods is easily susceptible to efficiency loss when faced with a high number of comparisons when, for example, a linearized graph genome represents a significant number of combinations. Accordingly, if the alignment is advanced 114, the first alignment tool may comprise a graph-aware or multi-dimensional algorithm, such as Graph Smith-Waterman, as explained in further detail below.
Optimal Alignment Algorithms
Pairwise alignment generally involves placing one sequence along part of target, introducing gaps according to an algorithm, scoring how well the two sequences match, and preferably repeating for various positions along the reference. The best-scoring match is deemed to be the alignment and represents an inference of homology between alignment portions of the sequences. In some embodiments, scoring an alignment of a pair of nucleic acid sequences involves setting values for the probabilities of substitutions and indels. When individual bases are aligned, a match or mismatch contributes to the alignment score by a substitution score, which could be, for example, 1 for a match and −0.33 for a mismatch. An indel deducts from an alignment score by a gap penalty, which could be, for example, −1. Gap penalties and substitution probabilities can be based on empirical knowledge or a priori assumptions about how sequences evolve. Their values affect the resulting alignment. Particularly, the relationship between the gap penalties and substitution probabilities influences whether substitutions or indels will be favored in the resulting alignment.
Stated formally, an alignment represents an inferred relationship between two sequences, x and y. For example, in some embodiments, an alignment A of sequences x and y maps x and y respectively to another two strings x′ and y′ that may contain spaces such that: (i) |x′|=|y′|; (ii) removing spaces from x′ and y′ should get back x and y, respectively; and (iii) for any i, x′[i] and y′[i] cannot be both spaces.
A gap is a maximal substring of contiguous spaces in either x′ or y′. An alignment A can include the following three kinds of regions: (i) matched pair (e.g., x′[i]=y′[i]; (ii) mismatched pair, (e.g., x′[i]—y′[i] and both are not spaces); or (iii) gap (e.g., either x′[i . . . j] or y′[i . . . j] is a gap). In certain embodiments, only a matched pair has a high positive score a. In some embodiments, a mismatched pair generally has a negative score b and a gap of length r also has a negative score g+rs where g, s<0. For DNA, one common scoring scheme (e.g. used by BLAST) makes score a=1, score b=−3, g=−5 and s=−2. The score of the alignment A is the sum of the scores for all matched pairs, mismatched pairs and gaps. The alignment score of x and y can be defined as the maximum score among all possible alignments of x and y.
Any pair may have a score a defined by a 4×4 matrix B of substitution probabilities. For example, B(i,i)=1 and 0<B(i,j)<1 [for i≠j], is one possible scoring system. For instance, where a transition is thought to be more biologically probable than a transversion, matrix B could include B(C,T)=0.7 and B(A,T)=0.3, or other values desired or determined by methods known in the art.
A pairwise alignment, generally, involves—for sequence Q (query) having m characters and a reference genome T (target) of n characters—finding and evaluating possible local alignments between Q and T. For any 1≦i≦n and 1≦j≦m, the largest possible alignment score of T[h . . . i] and Q[k . . . j], where h≦i and k≦j, is computed (i.e. the best alignment score of any substring of T ending at position i and any substring of Q ending at position j). This can include examining all substrings with cm characters, where c is a constant depending on a similarity model, and aligning each substring separately with Q. Each alignment is scored, and the alignment with the preferred score is accepted as the alignment. One of skill in the art will appreciate that there are exact and approximate algorithms for sequence alignment. Exact algorithms will find the highest scoring alignment, but can be computationally expensive. Two well-known exact algorithms are Needleman-Wunsch (J Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol, 147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3), 705-708, 1982) reduces the calculation time from O(m²n) to O(mn) where m and n are the sequence sizes being compared and is more amendable to parallel processing. In the field of bioinformatics, it is Gotoh's modified algorithm that is often referred to as the Smith-Waterman algorithm. Smith-Waterman approaches are being used to align larger sequence sets against larger reference sequences as parallel computing resources become more widely and cheaply available. See, e.g., Amazon's cloud computing resources. All of the journal articles referenced herein are incorporated by reference in their entireties.
The original Smith-Waterman (SW) algorithm is a linear alignment algorithm that aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences. Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence. Furthermore, because SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
The original SW algorithm is expressed for an n×m matrix H, representing the two strings of length n and m, in terms of equation (1):
H_k0=H_01=0 (for 0≦k≦n and 0≦l≦m)
H_ij=max{H_(i−1,j−1)+s(a_i,b_j),H_(i−1,j)−W_in,H_(i,j−1)−W_del,0} (1)
(for 1≦i≦n and 1≦j≦m)
In the equations above, s(ai,bj) represents either a match bonus (when ai=bj) or a mismatch penalty (when ai≠bj), and insertions and deletions are given the penalties Win and Wdel, respectively. In most instances, the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
Once the matrix has been fully populated with scores, the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi−1,j−1, Hi−1,j, or Hi,j−1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached. The optimal-scoring alignment may contain greater than the minimum possible number of insertions and deletions, while containing far fewer than the maximum possible number of substitutions.
SW or SW-Gotoh may be implemented using dynamic programming to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively. This dynamic programming employs tables or matrices to preserve match scores and avoid re-computation for successive cells. Each element of the string can be indexed with respect to a letter of the sequence, that is, if S is the string ATCGAA, S[1]=A.
Instead of representing the optimum alignment as Hi,j (above), the optimum alignment can be represented as B[j,k] in equation (2) below:
B[j,k]=max(p[j,k],i[j,k],d[j,k],0) (for 0<j≦m,0<k≦n) (2)
The arguments of the maximum function, B[j,k], are outlined in equations (3)-(5) below, wherein MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN, DELETION_PEN, and OPENING_PEN are all constants, and all negative except for MATCH_BONUS (PEN is short for PENALTY). The match argument, p[j,k], is given by equation (3), below:
p[j,k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MISMATCH_PEN, if S[j]A[k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MATCH_BONUS, if S[j]=A[k] (3)
the insertion argument i[j,k], is given by equation (4), below:
i[j,k]=max(p[j−1,k]+OPENING_PEN,i[j−1,k],d[j−1,k]+OPENING_PEN)+INSERTION_PEN (4)
and the deletion argument d[j,k], is given by equation (5), below:
d[j,k]=max(p[j,k−1]+OPENING_PEN,i[j,k−1]+OPENING_PEN,d[j,k−1])+DELETION_PEN (5)
For all three arguments, the [0,0] element is set to zero to assure that the backtrack goes to completion, i.e., p[0,0]=i[0,0]=d[0,0]=0.
The scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations. One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
MATCH_BONUS: 10
MISMATCH_PEN: −20
INSERTION_PEN: −40
OPENING_PEN: −10
DELETION_PEN: −5
The relationship between the gap penalties (INSERTION_PEN, OPENING_PEN) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost. Of course, alternative relationships between MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN, OPENING_PEN and DELETION_PEN are possible.
Smith-Waterman and Needleman-Wunsch are exact alignment algorithms and identify optimal alignments, including those with short insertions and deletions. However, it can be as computationally expensive as naïve pattern matching algorithms, and has a processing time typically on the order of O(mn). As a result, the use of the Smith-Waterman algorithm is primarily reserved for local alignment. However, when applied to a multi-dimensional data structure such as a reference DAG, these algorithms can take an excessive amount of time to complete. In these cases, more efficient pattern matching algorithms may be substituted.
Pattern Matching Algorithms
Pattern matching algorithms use mathematical techniques to reduce the total number of pairwise comparisons in order to improve efficiency. In a short read alignment, a short read P having n nucleotides is aligned with the left-most portion of a reference sequence T, creating a window of length n. After a match or mismatch determination, the window is shifted to the right for additional comparisons. Pattern matching algorithms increase local alignment efficiency by shifting the window farther to the right that a single character, i.e., by skipping a certain number of comparisons which would result in a mismatch. In particular, some pattern matching algorithms process a sequence read, or pattern, in order to create a “shift table” that indicates how many comparisons can safely be skipped depending on the mapping of the short read at the current position.
To create a shift table, many pattern matching algorithms operate in two phases: a pre-processing phase and a search phase. The pre-processing phase typically processes the pattern itself to generate information that can aid the search phase by providing a table of shift values after each comparison. Improvement in efficiency of a search can be achieved by altering the order in which the characters are compared at each attempt, and by choosing a shift value that permits the skipping of a predefined number of characters in the text after each attempt. The search phase then uses this information to ignore certain character comparisons, reducing the total number of comparisons and thus the overall execution time.
The Boyer-Moore algorithm is one of the most efficient. See Boyer, R. S., Moore, J. S., “A Fast String Searching Algorithm”, Comm. ACM 20(10), 762-772. Boyer-Moore preprocesses a pattern to create two tables: a Boyer-Moore bad character (bmBc) table and a Boyer-Moore good suffix (bmGs) table. For each character in the alphabet (e.g., “A”, “C”, G″, and “T”, representing the four possible nucleotides), the bad character table stores a shift value based on the occurrence of the character in the pattern (i.e., short read). The good-suffix table stores the matching shift value for each character in the read. The maximum of the shift value in either table is considered after each comparison during a search, which is then used to advance the window. Boyer-Moore does not allow any mismatches between the read and the reference sequence, and thus is a perfect-match search algorithm.
Boyer-Moore serves as the basis for several pattern matching algorithms that can be used for sequence alignment. One variation of Boyer-Moore is the Horspool algorithm. See Horspool, R. N., “Practical Fast Searching in Strings”, Software—Practice & Experience, 10, 501-506. Horspool simplifies Boyer-Moore by recognizing that the bad-character shift of the right-most character of the window can be sufficient to compute the value of the shift. These shift values are then determined in the pre-processing stage for all of the characters in the alphabet, i.e., the possible nucleotides. Horspool is particularly efficient in practical situations when the length of the pattern is small, such as for sequence reads generated by next-generation sequencing. Like Boyer-Moore, Horspool is a perfect-match search algorithm.
Another variation of Boyer-Moore that allows for a certain number of mismatches is the Tarhio-Ukkonen algorithm. See Tarhio, J., Ukkonen, E., “Approximate Boyer-Moore String Matching”, SIAM J. Comput., 22, 243-260. Because it allows for mismatches, Tarhio-Ukkonen can be used to provide approximate alignments for next generation sequencing reads to a reference sequence. During a pre-processing phase, Tarhio-Ukkonen calculates shift values for every character that take into account a desired allowable number of mismatches, k. Tarhio-Ukkonen is able to identify matches in expected time
$(kn (\frac{1}{m - k} + \frac{k}{c})),$
where c is the size of the alphabet. The use of approximation in pattern matching algorithms such as Tarhio-Ukkonen is particularly useful for next generation sequencing alignments due to inherent noise present in sequence reads generated by most next generation sequencing technologies. Furthermore, because the primary goal of most genomic sequence analyses is to identify unknown variations (primarily, SNPs) that may be present in a sample, approximate pattern matching algorithms help to identify novel polymorphisms in sequence data. In contrast, the Boyer-Moore and Horspool algorithms would reject such reads if there were any differences between the read and the reference.
When used for next generation sequencing alignment, pattern matching algorithms can be used as a fast local alignment tool. In particular, pattern matching algorithms typically have performance that exceeds that of traditional local alignment algorithms for sequence data, such as Needleman-Wunsch and Smith-Waterman. However, when applied to a graph genome, combinatorial complexity can quickly increase the number of comparisons to a point in which improvements in efficiency are lost. In other words, for a local alignment to a subset of a reference DAG, the expected time for the Boyer-Moore family is multiplied by the number of linearized paths. Additionally, despite variants that are able to address mismatches and small insertions and deletions, the Boyer-Moore family of algorithms do not perform particularly well when handling alignments involving novel variations (that are not already present in the graph). Exact alignment algorithms, such as Needleman-Wunsch and Smith-Waterman are able to accurately identify variations; however, these algorithms run in O(mn) time and similarly suffer from combinatorial explosion when applied to a DAG.
One solution to this problem is to use a multi-dimensional or graph-aware algorithm, such as if the alignment is advanced. One example of a graph-aware alignment algorithm is a modified Smith-Waterman algorithm that considers the highest scoring paths through a DAG (“Graph Smith Waterman”). Like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions. However, when applied to a linear reference or a DAG having low complexity, Graph Smith-Waterman can be slow relative to the alignment speed of a pattern matching algorithm.
Graph-Aware Alignment Algorithms
Certain alignment algorithms have been developed that are able to provide accurate and relatively fast local alignments when used with DAGs or other graph data structures. For example, a Graph Smith-Waterman algorithm as described below calculates a Smith-Waterman score matrix between the sequence read and each node in a DAG, such that the maximum scores are propagated across the matrices for each node. In this way, the linear portions of the DAG with the highest scores in relation to the sequence read are serialized and chained together. One need only look back through the chained Smith-Waterman matrices to identify both the optimal alignment and path of the DAG.
In some embodiments, the methods and systems of the invention use a modified Smith-Waterman operation that involves a multi-dimensional look-back through the reference DAG (such as the reference DAG 331 of FIG. 3). Multi-dimensional operations of the invention provide for a “look-back” type analysis of sequence information (as in Smith-Waterman), wherein the look back is conducted through a multi-dimensional space that includes multiple pathways and multiple nodes. The multi-dimensional algorithm can be used to align sequence reads against the graph-type reference. That alignment algorithm identifies the maximum value for Ci,j by identifying the maximum score with respect to each sequence contained at a position on the graph. In fact, by looking “backwards” at the preceding positions, it is possible to identify the optimum alignment across a plurality of possible paths.
The modified Smith-Waterman operation described here, aka the multi-dimensional alignment, provides exceptional speed when performed in a genomic graph system that employs physical memory addressing (e.g., through the use of native pointers or index free adjacency as discussed above). The combination of multi-dimensional alignment to the graph of reference DAG 331 with the use of spatial memory addresses (e.g., native pointers or index-free adjacency) improves what the computer system is capable of, facilitating whole genomic scale analysis to be performed using the methods described herein.
The operation includes aligning a sequence, or string, to a graph. For the purpose of defining the algorithm, let S be the string being aligned, and let D be the directed graph to which S is being aligned. The elements of the string, S, are bracketed with indices beginning at 1. Thus, if S is the string ATCGAA, S[1]=A, S[4]=G, etc.
In certain embodiments, for the graph, each letter of the sequence of a node will be represented as a separate element, d. In a preferred embodiment, node or edge objects contain the sequences and the sequences are stored as the longest-possible string in each object. A predecessor of d is defined as:
(i) If d is not the first letter of the sequence of its object, the letter preceding d in its object is its (only) predecessor;
(ii) If d is the first letter of the sequence of its object, the last letter of the sequence of any object that is a parent of d's object is a predecessor of d.
The set of all predecessors is, in turn, represented as P[d].
In order to find the “best” alignment, the algorithm seeks the value of M[j,d], the score of the optimal alignment of the first j elements of S with the portion of the graph preceding (and including) d. This step is similar to finding Hi,j in equation 1 above. Specifically, determining M[j,d] involves finding the maximum of a, i, e, and 0, as defined below:
M[j,d]=max{a,i,e,0} (6)
where
e=max{M[j, p*]+DELETE_PEN} for p* in P[d]
i=M[j−1, d]+INSERT_PEN
a=max{M[j−1, p*]+MATCH_SCORE} for p* in P[d], if S[j]=d;
max{M[j−1, p*]+MISMATCH_PEN} for p* in P[d], if S[j]≠d
As described above, e is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including, d, plus an additional DELETE_PEN. Accordingly, if d is not the first letter of the sequence of the object, then there is only one predecessor, p, and the alignment score of the first j characters of S with the graph (up-to-and-including p) is equivalent to M[j,p]+DELETE_PEN. In the instance where d is the first letter of the sequence of its object, there can be multiple possible predecessors, and because the DELETE_PEN is constant, maximizing [M[j, p*]+DELETE_PEN] is the same as choosing the predecessor with the highest alignment score with the first j characters of S.
In equation (6), i is the alignment of the first j−1 characters of the string S with the graph up-to-and-including d, plus an INSERT_PEN, which is similar to the definition of the insertion argument in SW (see equation 1).
Additionally, a is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including d, plus either a MATCH_SCORE (if the jth character of S is the same as the character d) or a MISMATCH_PEN (if the jth character of S is not the same as the character d). As with e, this means that if d is not the first letter of the sequence of its object, then there is only one predecessor, i.e., p. That means a is the alignment score of the first j−1 characters of S with the graph (up-to-and-including p), i.e., M[j−1,p], with either a MISMATCH_PEN or MATCH_SCORE added, depending upon whether d and the jth character of S match. In the instance where d is the first letter of the sequence of its object, there can be multiple possible predecessors. In this case, maximizing {M[j,p*]+MISMATCH_PEN or MATCH_SCORE} is the same as choosing the predecessor with the highest alignment score with the first j−1 characters of S (i.e., the highest of the candidate M[j−1,p*] arguments) and adding either a MISMATCH_PEN or a MATCH_SCORE depending on whether d and the jth character of S match.
Again, as in the SW algorithm, the penalties, e.g., DELETE_PEN, INSERT_PEN, MATCH_SCORE and MISMATCH_PEN, can be adjusted to encourage alignment with fewer gaps, etc.
As described in the equations above, the operation finds the optimal (e.g., maximum) value for a sequence read 103 to the reference DAG 331 by calculating not only the insertion, deletion, and match scores for that element, but looking backward (against the direction of the graph) to any prior nodes on the graph to find a maximum score.
FIG. 3 shows matrices 301 that represent the comparison. The modified Smith-Waterman operation of the invention identifies the highest score and performs a backtrack to identify the proper alignment of the sequence. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. Systems and methods of the invention can be used to provide a report that identifies a modified base at the position within the genome of the subject. Other information may be found in Kehr et al., 2014, Genome alignment with graph data structures: a comparison, BMC Bioinformatics 15:99 and Lee, 2003, Generating consensus sequences from partial order multiple sequence alignment graphs, Bioinformatics 19(8):999-1008, both incorporated by reference. Thus, identifying an alignment for the genomic information from the sample (e.g., the sequence read 103 of the embodiment of FIG. 4) may include aligning 129 (e.g., using the global alignment 106 or the local alignment 108) one or more of the sequence reads 103 to the reference DAG 331, as shown in FIG. 4. The branches to which the sequence reads 103 have been aligned are identified, along with the corresponding regions within the reference genome.
Graph Smith-Waterman is able to achieve similar computational speed as Smith-Waterman, yet does not suffer from the combinatorial explosion associated with linearizing complex regions of DAGs. In contrast, the performance of Graph Smith-Waterman scales linearly as the complexity of the DAG increases. Additionally, like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions. However, when applied to a linear reference sequence or a DAG having low complexity, Graph Smith-Waterman can be slow compared to pattern matching algorithms.
FIG. 4 illustrates finding 129 alignments 401 between the sequence 103 (or a consensus sequence representing an assembly or one or more sequence reads 103) and the reference DAG 331. Alignments can be performed directly against the DAG using the Graph Smith-Waterman algorithm, for example; however, as will be discussed below, alignments may also be performed against the DAG using linear alignment or pattern matching algorithms by first linearizing the DAG.
FIG. 4 also illustrates an optional variant calling 807 step to identify a subject genotype 831. Using alignment operations of the invention, reads can be rapidly mapped to the reference DAG 331 despite their large numbers or short lengths. Numerous benefits obtain by using a graph as a reference. For example, aligning against a graph is more accurate than aligning against a linear reference and then attempting to adjust one's results in light of other extrinsic information. This is primarily because the latter approach enforces an unnatural asymmetry between the sequence used in the initial alignment and other information. Aligning against an object that potentially represents all the relevant physical possibilities is much more computationally efficient than attempting to align against a linear sequence for each physical possibility (the number of such possibilities will generally be exponential in the number of junctions).
Many implementations provide methods that rely on a directed acyclic data structure with the data in an annotated transcriptome. In such a data structure, features (e.g., exons and introns) from the annotated transcriptome are represented as nodes, which are connected by edges. An isoform from the real-world transcriptome is thus represented by a corresponding path through the DAG-based data structure.
Aspects of the described methods relate to the creation or use of a DAG that includes features such as introns and exons from one or more known references. A DAG is understood in the art to refer to data that can be presented as a graph as well as to a graph that presents that data. The DAG can be stored as data that can be read by a computer system for bioinformatic processing or for presentation as a graph. A DAG can be saved in any suitable format including, for example, a list of nodes and edges, a matrix or a table representing a matrix, an array of arrays or similar variable structure representing a matrix, in a language built with syntax for graphs, in a general markup language purposed for a graph, or others. Further, while the DAG is a directed acyclic data structure, in certain embodiments, a reference graph may lack direction or not be acyclic. Various embodiments are considered to be within the scope of the disclosure.
DAG Representation and Storage
FIG. 5 shows the use of an adjacency list 502 for each vertex 505 and edge 509. A computer system 901 (shown in FIG. 10) creates the reference DAG 531 using an adjacency list 502 for each vertex and edge, wherein the adjacency list 502 for a vertex 505 or edge 509 lists the edges or vertices to which that vertex or edge is adjacent. Each entry in adjacency list 502 is a pointer to the adjacent vertex or edge.
Preferably, each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored. In the preferred embodiments, the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory and permits access to the intended data by means of pointer dereference. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer. The feature that separates pointers from other kinds of reference is that a pointer's value is interpreted as a memory address, at a low-level or hardware level. The speed and efficiency of the described graph genome engine allows a sequence to be queried against a large-scale genomic reference graph (e.g., the reference DAG 531) representing millions or billions of bases, using the computer system 901. Such a graph representation provides means for fast random access, modification, and data retrieval.
In some embodiments, fast random access is supported and graph object storage are implemented with index-free adjacency in that every element contains a direct pointer to its adjacent elements (e.g., as described in U.S. Pub. 2014/0280360 and U.S. Pub. 2014/0278590, each incorporated by reference), which obviates the need for index look-ups, allowing traversals (e.g., as done in the modified SW alignment operation described herein) to be very rapid. Index-free adjacency is another example of low-level, or hardware-level, memory referencing for data retrieval (as required in alignment and as particularly pays off in terms of speed gains in the modified, multi-dimensional Smith-Waterman alignment described below). Specifically, index-free adjacency can be implemented such that the pointers contained within elements are references to a physical location in memory.
Since a technological implementation that uses physical memory addressing such as native pointers can access and use data in such a lightweight fashion without the requirement of separate index tables or other intervening lookup steps, the capabilities of a given computer, e.g., any modern consumer-grade desktop computer, are extended to allow for full operation of a genomic-scale graph (i.e., the reference DAG 531 as a graph that represents all loci in a substantial portion of the reference genome). Thus storing graph elements (e.g., nodes and edges) using a library of objects with native pointers or other implementation that provides index-free adjacency actually improves the ability of the technology to provide storage, retrieval, and alignment for genomic information since it uses the physical memory of a computer in a particular way.
While no specific format is required for storage of a graph, FIG. 5 is presented to illustrate a useful format. With reference back to FIG. 1, it is noted that methods of the invention use the stored graph (e.g., the reference genome) with sequence reads that are obtained from a subject. In some embodiments, sequence reads or obtained as an electronic article, e.g., uploaded, emailed, or FTP transferred from a lab to the computer system 901. In certain embodiments, sequence reads are obtained by sequencing.
In some embodiments, a DAG is stored as a list of nodes and edges. One such way is to create a text file that includes all nodes, with an ID assigned to each node, and all edges, each with the node ID of starting and ending node. Thus, for example, were a DAG to be created for two sentences, “See Jane run,” and “Run, Jane run,”, a case-insensitive text file could be created. Any suitable format could be used. For example, the text file could include comma-separated values. Naming this DAG “Jane” for future reference, in this format, the DAG “Jane” may read as follows: 1 see, 2 run, 3 jane, 4 run, 1-3, 2-3, 3-4. One of skill in the art will appreciate that this structure is applicable to the introns and exons represented in FIGS. 7 and 8, and the accompanying discussion below.
In certain embodiments, a DAG is stored as a table representing a matrix (or an array of arrays or similar variable structure representing a matrix) in which the (i,j) entry in the N×N matrix denotes that node i and node j are connected (where N is a vector containing the nodes in genomic order). For the DAG to be acyclic simply requires that all non-zero entries be above the diagonal (assuming nodes are represented in genomic order). In a binary case, a 0 entry represents that no edge is exists from node i to node j, and a 1 entry represents an edge from i to j. One of skill in the art will appreciate that a matrix structure allows values other than 0 to 1 to be associated with an edge. For example, any entry may be a numerical value indicating a weight, or a number of times used, reflecting some natural quality of observed data in the world. A matrix can be written to a text file as a table or a linear series of rows (e.g., row 1 first, followed by a separator, etc.), thus providing a simple serialization structure.
Embodiments of the invention include storing a DAG in a language built with syntax for graphs. For example, The DOT Language provided with the graph visualization software packages known as Graphviz provides a data structure that can be used to store a DAG with auxiliary information and that can be converted into graphic file formats using a number of tools available from the Graphviz web site. Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images and SVG for web pages; PDF or Postscript for inclusion in other documents; or display in an interactive graph browser.
In related embodiments, a DAG is stored in a general markup language purposed for a graph, or others. Following the descriptions of a linear text file, or a comma-separated matrix, above, one of skill in the art will recognize that a language such as XML can be used (extended) to create labels (markup) defining nodes and their headers or IDs, edges, weights, etc. However a DAG is structured and stored, embodiments of the invention involve using nodes to represent features such as exons and introns. This provides a useful tool for analyzing sequence reads and discovering, identifying, and representing isoforms.
If nodes represent features or fragments of features, then isoforms can be represented by paths through those fragments. An exon's being “skipped” is represented by an edge connecting some previous exon to some later one. Presented herein are techniques for constructing a DAG to represent alternative splicing or isoforms of genes. Alternative splicing is discussed in Lee and Wang, 2005, Bioinformatics analysis of alternative splicing, Brief Bioinf 6(1):23-33; Heber, et al., 2002, Splicing graphs and EST assembly problems, Bioinformatics 18Suppl:s181-188; Leipzig, et al., 2004, The alternative splicing gallery (ASG): Bridging the gap between genome and transcriptome, Nucl Ac Res 23(13):3977-2983; and LeGault and Dewey, 2013, Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs, Bioinformatics 29(18):2300-2310, the contents of each of which are incorporated by reference. Additional discussion may be found in Florea, et al., 2005, Gene and alternative splicing annotation with AIR, Genome Research 15:54-66; Kim, et al., 2005, ECgene: Genome-based EST clustering and gene modeling for alternative splicing, Genome Research 15:566-576; and Xing, et al., 2006, An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs, Nucleic Acids Research, 34, 3150-3160, the contents of each of which are incorporated by reference.
Linearizing a DAG for a Linear Alignment Tool
As previously noted, a DAG is a multi-dimensional structure and not immediately amenable to alignment with a linear alignment tool. To use an optimal alignment algorithm or pattern matching algorithm, a reference DAG can be linearized into a format appropriate for the algorithm. In certain embodiments, this can be performed by enumerating each of the possible paths in the subsection of a reference DAG. FIG. 6 illustrates an exemplary DAG 601 representing a subsection of the reference DAG for a candidate mapping position (e.g., a local alignment position). To use a linear alignment algorithm with such a DAG, a depth-first search can be performed by starting at the first node and taking the left-most path. Once the end of the graph is reached, the search returns to the previous node with an untaken path and attempts to take the left-most path again. Each path is associated with a sequence represented by the nodes and edges in the path. The sequence read can then be compared against each of the paths. For example, a depth-first search would identify the following paths and associated sequences from the subsection of the DAG 661 shown in the embodiment of FIG. 6:
(SEQ ID NO: 1)

1. ATCGACGGCGTTTGCAT

(SEQ ID NO: 2)

2. ATCGACGGCGTTTAGCAT (insertion of A at pos. 13)

(SEQ ID NO: 3)

3. ATCGACGACGTTTGCAT (G−>A polymorphism at pos. 8)

(SEQ ID NO: 4)

4. ATCGACGACGTTTAGCAT (G−>A polymorphism at pos.

8 + insertion of A at pos. 13)

To perform a local alignment of the sequence read GTTTAG against this subsection of the reference DAG, these four sequences could be provided as individual reference sequences to a linear alignment or pattern matching algorithm. The linear alignment or pattern matching algorithm would subsequently identify a best-fit alignment for the sequence read with respect to the four sequences. In this way, systems and methods according to the disclosure may enumerate the possible paths for a reference DAG or subsection of a reference DAG, identify a sequence associated with each possible path, and provide each sequence to an alignment tool (such as those disclosed herein) for alignment with a sequence read.
In another embodiment, the subsection of the reference DAG could be linearized by walking the graph using a sliding window of n bases (wherein n is the length of the sequence read) and performing a comparison of the sequence read to the reference sequence in the window. (Of course, in certain embodiments, the size of the window can be increased to accommodate the possibility of small insertions and deletions in the sequence read.) The walk can similarly be performed using a depth-first search. For example, for the sequence read GTTTAG and subsection of the reference DAG 601 in the embodiment of FIG. 6, a local alignment could proceed as shown in Table 1.

TABLE 1

Depth-First Search Results using a sliding window of n bases and performing a
comparison of the sequence read to the reference sequence in the Window.

#	Path	Sequences Read	Comparison	Note

1	ATCGAC	GTTTAG	No Match	Start of Graph

2	TCGACG	GTTTAG	No Match	Includes “G” Polymorphism

3	CGACGG	GTTTAG	No Match	Includes “G” Polymorphism

4	GACGGC	GTTTAG	No Match	Includes “G” Polymorphism

5	ACGGCG	GTTTAG	No Match	Includes “G” Polymorphism

6	CGGCGT	GTTTAG	No Match	Includes “G” Polymorphism

7	GGCGTT	GTTTAG	No Match	Includes “G” Polymorphism

8	GCGTTT	GTTTAG	No Match	Includes “G” Polymorphism

9	CGTTTG	GTTTAG	No Match

10	GTTTGC	GTTTAG	No Match

11	TTTGCA	GTTTAG	No Match

12	TTGCAT	GTTTAG	No Match	End of Graph

13	CGTTTA	GTTTAG	No Match	Previous Unfollowed Path
				Includes “A” Insertion

14	GTTTAG	GTTTAG	Match	Includes “A” Insertion

15	TTTAGC	GTTTAG	No Match	Includes “A” Insertion

16	TTAGCA	GTTTAG	No Match	Includes “A” Insertion

17	TAGCAT	GTTTAG	No Match	Includes “A” Insertion
				End of Graph

18	CGACGA	GTTTAG	No Match	Previous Unfollowed Path
				Includes “A” Polymorphism

19	GACGAC	GTTTAG	No Match	Includes “A” Polymorphism

20	ACGACG	GTTTAG	No Match	Includes “A” Polymorphism

21	CGACGT	GTTTAG	No Match	Includes “A” Polymorphism

22	GACGTT	GTTTAG	No Match	Includes “A” Polymorphism

23	ACGTTT	GTTTAG	No Match	Includes “A” Polymorphism
				No remaining paths to follow

Once the walk has completed, the highest scoring matches can be identified to obtain the locally aligned position for the sequence read. As shown in Table 1, a match occurs at the 14^thcomparison. While a depth-first search is described for traversing the graph, various other searching algorithms or techniques may be used. For example, a breadth-first search could be substituted. Similarly, one could also employ an algorithm following the right-most path, or any previously unfollowed path. Various embodiments are considered to be within the scope of the disclosure.
As previously noted, pattern matching algorithms are particularly well suited for aligning sequence reads to DAGs. As known variations are already present in the graph, the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read. Instead, the pattern matching algorithm simply considers each path available through the graph, such as in the manner described above. However, as noted above, while pattern matching algorithms are fast, the performance of such an algorithm (and other local alignment algorithms) can be affected based on the complexity of the DAG surrounding the local alignment position.
Considering DAG Complexity to Select a Local Alignment Tool
The complexity of a DAG at a given region indicates the amount of variation in the reference genome at that position. The complexity of a DAG can be determined in a variety of ways. For example, the complexity of a DAG can be the number of nodes, number of vertices, number of branches, number of possible paths, and the like. Complexity may be determined for a subsection or region of the DAG, or for the entire DAG.
For example, a DAG 701 shown in FIG. 7 includes information for a single nucleotide polymorphism (C or D) and an insertion (EG or EFG). There are a total of seven nodes and four paths through the DAG. In another example, a DAG 801 shown in FIG. 8 indicates a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions. In this DAG 801, there are 10 nodes and 21 possible paths.
If there are few paths (e.g., 10 or less, 9 or less, 8 or less, 7 or less, 6 or less, 5 or less, 4 or less, 3 or less 2 or less) through the DAG, then the DAG complexity is low. In these cases, a local linear sequence alignment or pattern matching algorithm may be faster and more efficient than a graph aware algorithm. In these cases, the additional number of paths that can be tested by the algorithm does not significantly impact the advantages in speed gained by using a pattern matching algorithm (e.g., by using a shift table to avoid unnecessary computations). Further, the required number of comparisons can be performed in less processing time than by generating a plurality of matrices for each node, as required by Graph Smith-Waterman. However, if the complexity is high (e.g., 10 or more paths, 10-15 paths, 15 or more paths, 15-20 paths, or 20 or more paths) then the number of paths that must be tested quickly increases. In these cases, Graph Smith-Waterman may be more efficient than a linear sequence alignment or pattern matching algorithm because it does not suffer from the combinatorial explosion problem associated with linearizing complex DAGs.
The relationship between the alignment processing time and local DAG complexity is shown in FIG. 9 as a graph 900. Graph 900 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware algorithm (such as Graph Smith-Waterman). As shown, at a level of complexity C, performance of Tarhio-Ukkonen and Graph Smith-Waterman is equal and either algorithm may be used without a deviation in performance. Above C, Tarhio-Ukkonen becomes more inefficient due to the large number of paths that must be individually tested, and Graph Smith-Waterman may be preferred. In contrast, below C, Tarhio-Ukkonen is more efficient and requires less processing time.
Even though performance of either algorithm may be comparable at a level of complexity C, other considerations may dictate usage of either algorithm depending on the circumstances. For example, in certain instances, switching local alignment algorithms at a level of complexity C′ may be preferable instead of C. As discussed elsewhere, Graph Smith-Waterman is an exact alignment algorithm and thus able to identify gapped alignments between a sequence read and a reference. Thus, even if an approximate pattern-matching algorithm would execute in less time, it may still be preferable to choose Graph Smith-Waterman because the resulting alignment can be superior. In these cases, the ideal level of complexity at which the method chooses either algorithm is based on the acceptable timeliness of performance. As shown in FIG. 9, above C, Graph Smith-Waterman is may be preferable. However, below C, one may also decide to use Graph Smith-Waterman. The consideration is the tradeoff between processing time and improved alignments. Given the scales typically associated with next generation sequencing (e.g., to cover the human genome at 30× coverage, approximately 600 million paired-end 100 bp sequence reads are required), a processing method may account for a level of complexity in which the processing time is reasonable, yet the alignments are able to yield useful information regarding unknown variants.
Price may also be a consideration. For example, cloud computing services such as Amazon AWS and Microsoft Azure offer cloud-based computational resources at a cost. When using these services to align next generation sequence reads against DAGs, one may decide to switch to Tarhio-Ukkonen only when its performance is less than Graph Smith-Waterman in order to reduce processing time and increase cost efficiency.
In certain embodiments, a sequence read could be realigned depending on the resulting quality of the local alignment. For example, if a sequence read has an unknown variation (such as an insertion or deletion), and the complexity of a DAG at a candidate mapping position is low and a pattern matching algorithm is used, the resulting alignment may have low quality. This is because pattern matching algorithms, while fast, may not gracefully handle unknown variations. In these cases, an optimal alignment algorithm (such as Smith Waterman) could be subsequently performed to accurately identify the variation and improve the local alignment. Similarly, in high entropy or high variability areas of the genome, an optimal alignment algorithm could be substituted for a pattern matching algorithm to improve the quality of the alignment. The quality of an alignment may be expressed according to the Phred scale, for example.
Additionally, in certain embodiments, more than two alignment algorithms may be selected based on the complexity. For example, one may consider using a pattern matching algorithm, an optimal alignment algorithm, and a graph-aware algorithm as complexity increases. In this way, systems and methods according to the invention may select from two, three, or more local alignment algorithms to significantly improve the performance of next-generation sequencing alignment using graph data structures.
Exemplary System
FIG. 10 illustrates a computer system 901 suitable for performing methods of the invention. The computer system 901 includes at least one computer 633. Optionally, the computer system 901 may further include one or more of a server computer 909 and a sequencer 955, which may be coupled to a sequencer computer 951. Each computer in the computer system 901 includes a processor 935, 936, or 937 coupled to a memory device and at least one input/output device. Thus the computer system 901 includes at least one processor 935 coupled to a memory subsystem 975 (e.g., a memory device or collection of memory devices). Using those mechanical components, the computer system 901 is operable to obtain a sequence generated by sequencing nucleic acid from a genome of a patient. The system uses the processor to transform the sequence read 103 information into the reference DAG 331.
Processor refers to any device or system of devices that performs processing operations. A processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU). A processor may be provided by a chip from Intel or AMD. A processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
The memory subsystem 975 contains one or any combination of memory devices. A memory device is a mechanical device that stores data or instructions in a machine-readable format. Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein. Preferably, each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.
Using the described components, the computer system 901 is operable to produce a report and provide the report to a user via an input/output device. An input/output device is a mechanism or system for transferring data into or out of a computer. Exemplary input/output devices include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), a printer, an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a speaker, a touchscreen, an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device, which can be, for example, a network interface card (NIC), Wi-Fi card, or cellular modem.
Preferably the reference DAG is stored in the memory subsystem using adjacency techniques, which may include pointers to identify a physical location in the memory subsystem 675 where each vertex is stored. In a preferred embodiment, the graph is stored in the memory subsystem 675 using adjacency lists. In some embodiments, there is an adjacency list for each vertex. For discussion of implementations see ‘Chapter 4, Graphs’ at pages 515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson Education, Inc., Upper Saddle River N.J., 955 pages, the contents of which are incorporated by reference and within which pages 524-527 illustrate adjacency lists.

INCORPORATION BY REFERENCE

References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.

EQUIVALENTS

Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.
Other implementations are within the scope of the following claims.

Claims

What is claimed is:

1. A method, comprising:

obtaining a sequence read including genetic information;

identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes;

determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic;

performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic, wherein:

the advanced local alignment includes a first-local-alignment algorithm, and

the basic local alignment includes a second-local-alignment algorithm; and

based on the local alignments, identifying an optimal mapped position of the sequence read within the reference genome.

2. The method of claim 1, wherein the first-local-alignment algorithm is different from the second-local-alignment algorithm.

3. The method of claim 1, further comprising determining whether the local alignment is advanced or basic based on at least one of: a length of the graph, a variability of the graph, a total processing time, a remaining processing time, and a number of repeating elements.

4. The method of claim 1, further comprising determining whether the local alignment is advanced or basic based, at least in part, on a complexity of the graph.

5. The method of claim 4, further comprising determining whether the local alignment is advanced or basic based, at least in part, on a complexity of a subset of the graph surrounding each candidate mapping position.

6. The method of claim 5, wherein the local alignment is advanced if the complexity of a subset of the graph surrounding a candidate mapping position is 10 or more nodes.

7. The method of claim 5, wherein the local alignment is basic if the complexity of a subset of the graph surrounding a candidate mapping position is 5 or fewer nodes.

8. The method of claim 1, wherein the second-local-alignment algorithm comprises a pattern matching algorithm.

9. The method of claim 8, wherein the pattern matching algorithm is selected from the group consisting of: a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm.

10. The method of claim 1, wherein performing a basic local alignment comprises:

linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences, and

performing a basic local alignment of the sequence read against each of the plurality of linear sequences using the second-local-alignment algorithm.

11. The method of claim 10, wherein linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences comprises enumerating the number of unique paths through the subset of the graph, and associating a linear sequence with each enumerated path.

12. The method of claim 10, wherein linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences comprises performing a depth first search of the subset of the graph.

13. The method of claim 1, wherein the first-local-alignment algorithm comprises a graph aware algorithm.

14. The method of claim 13, wherein the graph aware algorithm is a modified Smith Waterman algorithm.

15. The method of claim 1, further comprising ranking each of the candidate mapping positions based on a quality of the local alignment.

16. The method of claim 15, further comprising re-aligning the sequence read using a third-local-alignment algorithm if the quality of the highest ranking local alignment is low.

17. A system for determining a subject's genetic information, the system comprising:

a computer system comprising a processor coupled to memory and operable to:

receive identities of a plurality of nucleotides at known locations on a reference genome;

receive sequence reads from a sample from a subject; and

map the sequence reads to the reference genome, thereby identifying a corresponding location on the reference genome, the mapping comprising:

determining, by means of a computer system, whether an alignment with the graph surrounding each of the identified plurality of candidate mapping positions is advanced or basic;

the advanced local alignment includes a first-local-alignment algorithm, and

the basic local alignment includes a second-local-alignment algorithm; and

based on the local alignments, identify the mapped position of the sequence read within the reference genome.

18. The system of claim 17, wherein the computer system is further operable to determine whether the local alignment is advanced or basic based, at least in part, on a complexity of a subset of the graph surrounding each candidate mapping position.

19. The system of claim 17, wherein the first-local-alignment algorithm comprises a graph aware alignment algorithm, and the second-local-alignment algorithm comprises a linear alignment algorithm.

20. The system of claim 17, wherein performing a basic local alignment comprises: