US20170199960A1 - Systems and methods for adaptive local alignment for graph genomes - Google Patents

Systems and methods for adaptive local alignment for graph genomes Download PDF

Info

Publication number
US20170199960A1
US20170199960A1 US14/990,323 US201614990323A US2017199960A1 US 20170199960 A1 US20170199960 A1 US 20170199960A1 US 201614990323 A US201614990323 A US 201614990323A US 2017199960 A1 US2017199960 A1 US 2017199960A1
Authority
US
United States
Prior art keywords
alignment
local
graph
algorithm
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/990,323
Inventor
Kaushik Ghose
Wan-Ping Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seven Bridges Genomics Inc
Original Assignee
Seven Bridges Genomics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US14/990,323 priority Critical patent/US20170199960A1/en
Application filed by Seven Bridges Genomics Inc filed Critical Seven Bridges Genomics Inc
Assigned to VENTURE LENDING & LEASING VII, INC. reassignment VENTURE LENDING & LEASING VII, INC. SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SEVEN BRIDGES GENOMICS INC.
Assigned to SEVEN BRIDGES GENOMICS INC. reassignment SEVEN BRIDGES GENOMICS INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GHOSE, Kaushik, LEE, WAN-PING
Priority to PCT/US2017/012015 priority patent/WO2017120128A1/en
Publication of US20170199960A1 publication Critical patent/US20170199960A1/en
Assigned to SEVEN BRIDGES GENOMICS INC., SEVEN BRIDGES GENOMICS D.O.O., SEVEN BRIDGES GENOMICS UK LTD. reassignment SEVEN BRIDGES GENOMICS INC. TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS Assignors: VENTURE LENDING & LEASING VII, INC.
Assigned to BROWN RUDNICK reassignment BROWN RUDNICK NOTICE OF ATTORNEY'S LIEN Assignors: SEVEN BRIDGES GENOMICS INC.
Assigned to MJOLK HOLDING BV reassignment MJOLK HOLDING BV SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SEVEN BRIDGES GENOMICS INC.
Assigned to SEVEN BRIDGES GENOMICS INC. reassignment SEVEN BRIDGES GENOMICS INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: MJOLK HOLDING BV
Assigned to SEVEN BRIDGES GENOMICS INC. reassignment SEVEN BRIDGES GENOMICS INC. TERMINATION AND RELEASE OF NOTICE OF ATTORNEY'S LIEN Assignors: BROWN RUDNICK LLP
Priority to US16/663,243 priority patent/US11810648B2/en
Priority to US18/464,597 priority patent/US20240096450A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F19/22
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the invention relates to systems and methods for adaptive local alignment for graph genomes.
  • a person's genetic information has the potential to reveal much about their health and life.
  • a risk of cancer or a genetic disease may be revealed by the sequences of the person's genes, as well as the possibility that his or her children could inherit a genetic disorder.
  • Genetic information can also be used to identify an unknown organism, such as potentially infectious agents discovered in samples from public food or water supplies.
  • Next-generation sequencing (NGS) technologies are available that can sequence entire genomes quickly.
  • the described methods and systems provide a method for automatically analyzing sequence data using a graph, such as a reference directed acyclic graph (DAG), and accounting for the various difficulties (e.g., complexities) of sequence alignment and genomic variation without requiring intermediate input or guidance from a user.
  • a graph such as a reference directed acyclic graph (DAG)
  • DAG reference directed acyclic graph
  • the described methods and systems simplify and optimize the process for sequence alignment between a sample sequence and a reference genome because the systems and methods function autonomously from the user (after the sequence information is received).
  • a processor can, for example, identify and align the sequence information in a manner that meets a user's requirements for accuracy and expediency.
  • the sequence information can be directly transmitted to the systems described herein, which helps to provide a vastly simplified sequence alignment experience for the user.
  • DAGs directed acyclic graphs
  • SNPs single nucleotide polymorphisms
  • Indels small insertions and deletions
  • SVs structural variants
  • SNPs single nucleotide polymorphisms
  • indels small insertions and deletions
  • SVs structural variants
  • SNPs single nucleotide polymorphisms
  • indels small insertions and deletions
  • SVs structural variants
  • Each of these variations can be represented by following different branches through the DAG.
  • using DAGs can help to increase the probability of identifying small variations near large known structural variants because the presence of the known structural variants in the graph improves the quality of the alignment.
  • known variations can be reliably accounted for and identified by aligning reads containing the known variations to a sequence path that includes those variations.
  • the methods and systems described throughout this document can automatically and efficiently analyze sequence data using the reference DAG even when doing so would conventionally be undesirable (e.g., computationally expensive, time consuming, imprecise, and/or cost prohibitive).
  • the systems and methods can analyze the difficulty of aligning the sequence data with the reference DAG based on a predicted alignment difficulty (e.g., due to the DAG complexity, sequence length, total processing time, remaining processing time, or any combination thereof) between the reference DAG and the sequence data. By choosing and employing a selected alignment algorithm based on this predicted alignment difficulty, the methods and systems can efficiently analyze sequence data.
  • a method includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph including nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic.
  • the advanced local alignment includes a first-local-alignment algorithm
  • the basic local alignment includes a second-local-alignment algorithm.
  • the method further includes identifying the mapped position of the sequence read within the reference genome.
  • the system for determining a subject's genetic information includes a computer system comprising a processor coupled to memory and operable to: receive identities of a plurality of nucleotides at known locations on a reference genome; receive sequence reads from a sample from a subject; and map the sequence reads to the reference genome, thereby identifying a corresponding location on the reference genome.
  • the mapping includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic.
  • the advanced local alignment includes a first-local-alignment algorithm
  • the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the system identifies the mapped position of the sequence read within the reference genome.
  • Implementations can include one or more of the following features.
  • the method further includes performing, by means of the computer system, a global alignment for the sequence read based on whether the global alignment is advanced or basic.
  • the advanced global alignment includes a first-global-alignment algorithm
  • the basic global alignment includes alignment includes a second-global-alignment algorithm.
  • the global alignment provides information indicative of the plurality of candidate mapping positions.
  • the first-local-alignment algorithm is different from the second-local-alignment algorithm.
  • the first-global-alignment algorithm is different from the second-global-alignment algorithm.
  • At least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
  • the method further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
  • a complex graph includes 10 or more nodes.
  • a simple graph includes 5 or fewer nodes.
  • the first-local-alignment algorithm includes a pattern matching algorithm.
  • the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm.
  • the second-local-alignment algorithm includes a linear alignment algorithm.
  • each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the reference genome.
  • candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
  • the method further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
  • each of the plurality of candidate mapping positions defines a selected path through the graph.
  • the processor is further operable to: perform a global alignment, by means of the computer system, for the sequence read based on whether the global alignment is advanced or basic.
  • the advanced global alignment includes a first-global-alignment algorithm
  • the basic global alignment includes a second-global-alignment algorithm.
  • the global alignment provides information indicative of the plurality of candidate mapping positions.
  • each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the candidate mapping position.
  • the complexity of the graph varies with a remaining processing time estimate to reduce a total processing time.
  • the first-local-alignment algorithm is different from the second-local-alignment algorithm.
  • the first-global-alignment algorithm is different from the second-global-alignment algorithm.
  • At least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
  • the mapping further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
  • a complex graph includes 10 or more nodes.
  • a simple graph includes 5 or fewer nodes.
  • the first-local-alignment algorithm includes a pattern matching algorithm.
  • the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm and a Tarhio-Ukkonen algorithm.
  • the second-local-alignment algorithm includes a linear alignment algorithm.
  • the candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
  • the mapping further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
  • each of the plurality of candidate mapping positions defines a selected path through the graph.
  • Implementations can include one or more of the following advantages.
  • a linear alignment method can be used to save resources (e.g., time and computational costs).
  • resources e.g., time and computational costs.
  • other sequence alignment methods such as the modified Graph Smith-Waterman algorithm may be an alignment tool well-suited to the task despite the high initial computational costs.
  • the predicted alignment difficulty can vary during the alignment process.
  • the alignment techniques described herein can continually assess the alignment difficulty and dynamically adapt the alignment tools in response to any changes. For example, two or more alignment tools can be used during the same alignment process.
  • FIG. 1 diagrams a method for aligning sequence reads to a reference genome.
  • FIG. 2 diagrams a method for performing a global alignment of a sequence read to a reference genome.
  • FIG. 3 shows the matrices that represent the comparison.
  • FIG. 4 illustrates finding alignments between a sequence and a reference DAG.
  • FIG. 5 shows the use of an adjacency list for each vertex and edge in a DAG.
  • FIG. 6 shows an exemplary DAG and sequence read.
  • FIG. 7 shows an exemplary DAG representing a single nucleotide polymorphism (C or D) and an insertion (EG or EFG).
  • FIG. 8 shows an exemplary DAG representing a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions.
  • FIG. 9 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware alignment algorithm (such as a Graph Smith-Waterman alignment method).
  • a pattern matching algorithm such as Tarhio-Ukkonen
  • a graph aware alignment algorithm such as a Graph Smith-Waterman alignment method
  • FIG. 10 illustrates a system for implementing the methods diagramed in FIGS. 1 and 2 .
  • the methods and systems described herein relate to determining a subject's genetic information.
  • Information representing at least a portion of the subject's DNA e.g., a sequence read
  • a graph such as a reference genomic directed acyclic graph (DAG)
  • DAG genomic directed acyclic graph
  • the genomic information can then be compared against the candidate paths with a higher degree of accuracy as compared to the original global alignment.
  • This two-step alignment analysis allows for the selection of an appropriate alignment algorithm based on the alignment difficulty and an acceptable alignment tolerance, thereby reaping considerable computational savings resources because resources are not needlessly expended.
  • the present disclosure recognizes, in certain embodiments, that the problem of combinatorial explosion associated with aligning sequence data to genomic DAGs is solved by an alignment method that analyzes the complexity of a DAG and chooses an alignment algorithm based on the complexity. For example, sequence reads can be quickly aligned to portions of a DAG that have low complexity using a pattern matching algorithm or a linear alignment algorithm. However, for complex portions of the DAG, the number of paths that must be analyzed by a pattern matching or linear alignment algorithm grows exponentially, leading to a corresponding exponential decrease in performance of the pattern matching or linear alignment tool. In these cases, a graph-aware algorithm may be substituted. The graph-aware algorithm may have a high initial overhead cost, but its decrease in performance scales linearly as the complexity of the graph increases, making it a preferred choice for complex regions of a DAG.
  • the following describes methods for indexing collections of data strings represented in a form of DAGs and performing efficient string search including exact and approximate matching techniques.
  • these methods can be used to quickly and efficiently find genomic data sequences (e.g., reads produced by DNA sequencing machines) in genome graphs representing genome data variations.
  • FIG. 1 diagrams a method 101 for analyzing genomic sequence information.
  • sequence reads from a sample from a subject are received.
  • Sequence reads can be received from a nucleic acid sequencing instrument, such as an Illumina® HiSeq2500, a Roche 454 GS FLX+, an Ion Torrent PGMTM, and the like.
  • a processor coupled to the tangible memory device is used to globally align 104 the sequence read against the reference genome to identify 106 one or more candidate mapping positions. These candidate mapping positions represent acceptable alignments between the received sequence reads and the paths through the reference DAG (e.g., a location within the reference genome that could be related to the sequence read).
  • a report can be provided that identifies a one or more of the candidate mapping positions.
  • Next-generation sequencing and alignment can be computationally expensive. For example, over 900 million paired-end 100 bp sequence reads are needed to obtain 30 ⁇ coverage of the human genome.
  • certain algorithms such as various pattern matching and linear alignment algorithms, scan each sequence read over a reference sequence to identify an exact, optimal, or best-fit match. But this approach is typically computationally infeasible given the scales associated with next-generation sequencing.
  • a heuristic approach may be used to reduce the total number of computations required. For example, one approach is to separate an alignment into “coarse” and “fine” stages.
  • each candidate region is subsequently analyzed using a more computationally expensive, yet exact algorithm.
  • This is referred to as a local alignment or local search because the alignment is restricted to the local space identified during the coarse stage. In certain examples, this may be referred to as a “seed and extend” strategy; the global alignment yields a number of seed positions, which are then extended during the local alignment stage. The candidate positions may then be scored and ranked, yielding the best aligned position for each sequence read.
  • one approach for performing a global alignment 104 is to place many small sections, or k-mers, of the reference DAG into a computer hash function.
  • the hash function calculates an index as a function of the k-mer.
  • the index identifies an entry in a hash table, and the positions of the k-mer within the reference DAG are stored in that entry, i.e., in the entry indexed by the hash of the k-mer.
  • a sequence read is analyzed by hashing some or all of its k-mers and the corresponding entries of the DAG hash table are accessed to read positions within the DAG where those sequence read k-mers can be found.
  • Sections of the DAG where a threshold number of k-mer positions are found are identified as candidate regions within the represented genomes for alignment or mapping of the sequence read.
  • a sequence read can be mapped to a large number of “good fit” positions within a reference genome very rapidly.
  • a sequence read can be quickly mapped to a reference using a global alignment algorithm because hash values can be retrieved rapidly (i.e., without iterating over entries in an array). Since a DAG can store a great amount of genetic variation, a sequence read can be mapped to an appropriate portion of one or more certain genomes from among numerous candidate genomes very rapidly. Since sequence reads can be mapped to biologically relevant reference genomes very rapidly, a patient's genetic information can be discerned, or genetic sequences can be identified quickly even against a background of all genetic variety represented in the system. For example, aligning a patient's sequence reads to a DAG representing variations in cancer genomes (such as those from The Cancer Genome Atlas) can be used to quickly identify particular mutations associated with cancer present in the patient.
  • a subsection of the graph at each candidate mapping position is identified.
  • the subsection of the graph includes at least the number of base pairs in the sequence read for searching (e.g., 50, 100, 200 bp).
  • longer subsections can be preferable for local search.
  • Using longer subsections is particularly advantageous when using paired-end or mate-pair sequence reads, which are pairs of sequence reads describing the 5′ and 3′ ends of a single DNA molecule, often with an inferred insert size distance of 300-600 (paired-end) or 2,000-5,000 (mate pair) base pairs between the 5′ and 3′ sequence reads.
  • the portions or subsections of the graph for each of the candidate mapping positions can be a variety of sizes ranging from 50 to even 10,000 base pairs.
  • the processor can then perform 108 a local alignment for each of the candidate mapping positions.
  • a processor can perform 108 a local alignment at each of the candidate mapping positions to narrow the candidate mapping positions to identify one or more mapped positions (e.g., “better fit” or “best fit” positions).
  • a processor can start 110 the local alignment process and then determine 112 whether an alignment with the reference DAG surrounding the candidate mapping positions (e.g., the subset of the reference DAG surrounding a candidate mapping position) is advanced or basic. In some examples, an alignment is advanced if the complexity of the subset of the reference DAG exceeds 5 paths.
  • the alignment is considered basic if the complexity of the subset of the reference DAG is 5 paths or less. If the alignment is advanced 114 , then the processor can use a first alignment tool implementing a first method 116 . If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 118 . The processor can repeat 120 this process until all candidate mapping positions are locally aligned. The processor can then rank 122 the candidate mapping positions to identify the most likely alignment positions corresponding to the received sequence read before ending 124 the local alignment process. Based on these rankings, the processor can identify 126 the mapped position within an acceptable margin of error.
  • a global alignment may be simply performed against a linear reference sequence. This can be advantageous in certain embodiments where using a linear reference sequence for the global alignment can help improve performance.
  • the local alignment 108 may still consider a reference DAG structure.
  • Various embodiments are considered to be within the scope of the disclosure.
  • a global alignment may also consider whether an alignment with the DAG is advanced or basic and select an appropriate algorithm accordingly.
  • FIG. 2 illustrates an exemplary global alignment process 201 for identifying candidate mapping positions for a sequence read.
  • the global alignment process 201 can evaluate the reference DAG to determine if the global alignment is advanced or basic. For example, this may be performed in situations wherein the reference DAG is small enough such that pattern matching and linear alignment algorithms may be used for the global alignment stage.
  • a processor can start 202 the local alignment process and then determine 204 whether an alignment with the reference DAG and the sequence read is advanced or basic. In some examples, an alignment is advanced if the complexity of the DAG exceeds 6 paths, as discussed below.
  • the alignment is basic if the complexity of the DAG is 6 paths or less. If the alignment is advanced 206 , then the processor can use a first alignment tool implementing a first method 208 . If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 210 . The first method 208 and the second method 210 may be the same, different, or partially overlap. In some examples, the first method 208 and/or the second method 210 can correspond to at least one of the first method 116 or the second method 118 . Using these alignment tools, the processor can identify 212 the candidate mapping positions. The processor can then repeat this process until the global alignment is complete 214 . The processor can then end 216 the global alignment process 201 .
  • the second alignment tool may comprise linear alignment methods, such as optimal alignment or pattern matching algorithms.
  • the second alignment tool may comprise an optimal alignment algorithm, such as a Smith-Waterman algorithm, a Needleman-Wunsch algorithm, and the like.
  • an optimal alignment algorithm such as a Smith-Waterman algorithm, a Needleman-Wunsch algorithm, and the like.
  • pattern matching or string matching algorithms such as a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm may be used. These algorithms are fast, but typically do not handle variations (such as indels and SNPs) as well as an optimal alignment algorithm.
  • pattern matching and string matching algorithms are particularly well suited for aligning sequence reads against graph genomes, such as DAGs.
  • the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read; it need only map the sequence read to the most likely branch within the DAG having that variation.
  • the first alignment tool may comprise a graph-aware or multi-dimensional algorithm, such as Graph Smith-Waterman, as explained in further detail below.
  • Pairwise alignment generally involves placing one sequence along part of target, introducing gaps according to an algorithm, scoring how well the two sequences match, and preferably repeating for various positions along the reference. The best-scoring match is deemed to be the alignment and represents an inference of homology between alignment portions of the sequences.
  • scoring an alignment of a pair of nucleic acid sequences involves setting values for the probabilities of substitutions and indels. When individual bases are aligned, a match or mismatch contributes to the alignment score by a substitution score, which could be, for example, 1 for a match and ⁇ 0.33 for a mismatch. An indel deducts from an alignment score by a gap penalty, which could be, for example, ⁇ 1.
  • Gap penalties and substitution probabilities can be based on empirical knowledge or a priori assumptions about how sequences evolve. Their values affect the resulting alignment. Particularly, the relationship between the gap penalties and substitution probabilities influences whether substitutions or indels will be favored in the resulting alignment.
  • an alignment represents an inferred relationship between two sequences, x and y.
  • an alignment A of sequences x and y maps x and y respectively to another two strings x′ and y′ that may contain spaces such that: (i)
  • a gap is a maximal substring of contiguous spaces in either x′ or y′.
  • a mismatched pair generally has a negative score b and a gap of length r also has a negative score g+rs where g, s ⁇ 0.
  • a scoring scheme e.g. used by BLAST
  • the score of the alignment A is the sum of the scores for all matched pairs, mismatched pairs and gaps.
  • the alignment score of x and y can be defined as the maximum score among all possible alignments of x and y.
  • Any pair may have a score a defined by a 4 ⁇ 4 matrix B of substitution probabilities.
  • a pairwise alignment generally, involves—for sequence Q (query) having m characters and a reference genome T (target) of n characters—finding and evaluating possible local alignments between Q and T. For any 1 ⁇ i ⁇ n and 1 ⁇ j ⁇ m, the largest possible alignment score of T[h . . . i] and Q[k . . . j], where h ⁇ i and k ⁇ j, is computed (i.e. the best alignment score of any substring of T ending at position i and any substring of Q ending at position j). This can include examining all substrings with cm characters, where c is a constant depending on a similarity model, and aligning each substring separately with Q.
  • the original Smith-Waterman (SW) algorithm is a linear alignment algorithm that aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences.
  • Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence.
  • SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
  • H _ ij max ⁇ H _( i ⁇ 1, j ⁇ 1)+ s ( a _ i,b _ j ), H _( i ⁇ 1, j ) ⁇ W _in, H _( i,j ⁇ 1) ⁇ W _del,0 ⁇ (1)
  • the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
  • the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi ⁇ 1,j ⁇ 1, Hi ⁇ 1,j, or Hi,j ⁇ 1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached.
  • the optimal-scoring alignment may contain greater than the minimum possible number of insertions and deletions, while containing far fewer than the maximum possible number of substitutions.
  • SW or SW-Gotoh may be implemented using dynamic programming to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively.
  • This dynamic programming employs tables or matrices to preserve match scores and avoid re-computation for successive cells.
  • the optimum alignment can be represented as B[j,k] in equation (2) below:
  • i[j,k ] max( p[j ⁇ 1, k ]+OPENING_PEN, i[j ⁇ 1, k],d[j ⁇ 1, k ]+OPENING_PEN)+INSERTION_PEN (4)
  • d[j,k ] max( p[j,k ⁇ 1]+OPENING_PEN, i[j,k ⁇ 1]+OPENING_PEN, d[j,k ⁇ 1])+DELETION_PEN (5)
  • the scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations.
  • One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
  • MISMATCH_PEN ⁇ 20
  • OPENING_PEN ⁇ 10
  • the relationship between the gap penalties (INSERTION_PEN, OPENING_PEN) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost.
  • MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN, OPENING_PEN and DELETION_PEN are possible.
  • Smith-Waterman and Needleman-Wunsch are exact alignment algorithms and identify optimal alignments, including those with short insertions and deletions. However, it can be as computationally expensive as na ⁇ ve pattern matching algorithms, and has a processing time typically on the order of O(mn). As a result, the use of the Smith-Waterman algorithm is primarily reserved for local alignment. However, when applied to a multi-dimensional data structure such as a reference DAG, these algorithms can take an excessive amount of time to complete. In these cases, more efficient pattern matching algorithms may be substituted.
  • Pattern matching algorithms use mathematical techniques to reduce the total number of pairwise comparisons in order to improve efficiency.
  • a short read P having n nucleotides is aligned with the left-most portion of a reference sequence T, creating a window of length n. After a match or mismatch determination, the window is shifted to the right for additional comparisons.
  • Pattern matching algorithms increase local alignment efficiency by shifting the window farther to the right that a single character, i.e., by skipping a certain number of comparisons which would result in a mismatch.
  • some pattern matching algorithms process a sequence read, or pattern, in order to create a “shift table” that indicates how many comparisons can safely be skipped depending on the mapping of the short read at the current position.
  • a pre-processing phase typically processes the pattern itself to generate information that can aid the search phase by providing a table of shift values after each comparison. Improvement in efficiency of a search can be achieved by altering the order in which the characters are compared at each attempt, and by choosing a shift value that permits the skipping of a predefined number of characters in the text after each attempt.
  • the search phase then uses this information to ignore certain character comparisons, reducing the total number of comparisons and thus the overall execution time.
  • Boyer-Moore preprocesses a pattern to create two tables: a Boyer-Moore bad character (bmBc) table and a Boyer-Moore good suffix (bmGs) table.
  • bmBc Boyer-Moore bad character
  • bmGs Boyer-Moore good suffix
  • the bad character table stores a shift value based on the occurrence of the character in the pattern (i.e., short read).
  • the good-suffix table stores the matching shift value for each character in the read. The maximum of the shift value in either table is considered after each comparison during a search, which is then used to advance the window.
  • Boyer-Moore does not allow any mismatches between the read and the reference sequence, and thus is a perfect-match search algorithm.
  • Boyer-Moore serves as the basis for several pattern matching algorithms that can be used for sequence alignment.
  • One variation of Boyer-Moore is the Horspool algorithm. See Horspool, R. N., “Practical Fast Searching in Strings”, Software—Practice & Experience, 10, 501-506. Horspool simplifies Boyer-Moore by recognizing that the bad-character shift of the right-most character of the window can be sufficient to compute the value of the shift. These shift values are then determined in the pre-processing stage for all of the characters in the alphabet, i.e., the possible nucleotides. Horspool is particularly efficient in practical situations when the length of the pattern is small, such as for sequence reads generated by next-generation sequencing. Like Boyer-Moore, Horspool is a perfect-match search algorithm.
  • Tarhio-Ukkonen Another variation of Boyer-Moore that allows for a certain number of mismatches is the Tarhio-Ukkonen algorithm. See Tarhio, J., Ukkonen, E., “Approximate Boyer-Moore String Matching”, SIAM J. Comput., 22, 243-260. Because it allows for mismatches, Tarhio-Ukkonen can be used to provide approximate alignments for next generation sequencing reads to a reference sequence. During a pre-processing phase, Tarhio-Ukkonen calculates shift values for every character that take into account a desired allowable number of mismatches, k. Tarhio-Ukkonen is able to identify matches in expected time
  • pattern matching algorithms When used for next generation sequencing alignment, pattern matching algorithms can be used as a fast local alignment tool.
  • pattern matching algorithms typically have performance that exceeds that of traditional local alignment algorithms for sequence data, such as Needleman-Wunsch and Smith-Waterman.
  • combinatorial complexity can quickly increase the number of comparisons to a point in which improvements in efficiency are lost.
  • the expected time for the Boyer-Moore family is multiplied by the number of linearized paths.
  • One solution to this problem is to use a multi-dimensional or graph-aware algorithm, such as if the alignment is advanced.
  • a graph-aware alignment algorithm is a modified Smith-Waterman algorithm that considers the highest scoring paths through a DAG (“Graph Smith Waterman”). Like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions.
  • Graph Smith-Waterman can be slow relative to the alignment speed of a pattern matching algorithm.
  • Certain alignment algorithms have been developed that are able to provide accurate and relatively fast local alignments when used with DAGs or other graph data structures. For example, a Graph Smith-Waterman algorithm as described below calculates a Smith-Waterman score matrix between the sequence read and each node in a DAG, such that the maximum scores are propagated across the matrices for each node. In this way, the linear portions of the DAG with the highest scores in relation to the sequence read are serialized and chained together. One need only look back through the chained Smith-Waterman matrices to identify both the optimal alignment and path of the DAG.
  • the methods and systems of the invention use a modified Smith-Waterman operation that involves a multi-dimensional look-back through the reference DAG (such as the reference DAG 331 of FIG. 3 ).
  • Multi-dimensional operations of the invention provide for a “look-back” type analysis of sequence information (as in Smith-Waterman), wherein the look back is conducted through a multi-dimensional space that includes multiple pathways and multiple nodes.
  • the multi-dimensional algorithm can be used to align sequence reads against the graph-type reference. That alignment algorithm identifies the maximum value for Ci,j by identifying the maximum score with respect to each sequence contained at a position on the graph. In fact, by looking “backwards” at the preceding positions, it is possible to identify the optimum alignment across a plurality of possible paths.
  • the modified Smith-Waterman operation described here aka the multi-dimensional alignment, provides exceptional speed when performed in a genomic graph system that employs physical memory addressing (e.g., through the use of native pointers or index free adjacency as discussed above).
  • the combination of multi-dimensional alignment to the graph of reference DAG 331 with the use of spatial memory addresses (e.g., native pointers or index-free adjacency) improves what the computer system is capable of, facilitating whole genomic scale analysis to be performed using the methods described herein.
  • the operation includes aligning a sequence, or string, to a graph.
  • S be the string being aligned
  • D be the directed graph to which S is being aligned.
  • the elements of the string, S are bracketed with indices beginning at 1 .
  • each letter of the sequence of a node will be represented as a separate element, d.
  • node or edge objects contain the sequences and the sequences are stored as the longest-possible string in each object.
  • a predecessor of d is defined as:
  • the algorithm seeks the value of M[j,d], the score of the optimal alignment of the first j elements of S with the portion of the graph preceding (and including) d. This step is similar to finding Hi,j in equation 1 above. Specifically, determining M[j,d] involves finding the maximum of a, i, e, and 0, as defined below:
  • e is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including, d, plus an additional DELETE_PEN. Accordingly, if d is not the first letter of the sequence of the object, then there is only one predecessor, p, and the alignment score of the first j characters of S with the graph (up-to-and-including p) is equivalent to M[j,p]+DELETE_PEN. In the instance where d is the first letter of the sequence of its object, there can be multiple possible predecessors, and because the DELETE_PEN is constant, maximizing [M[j, p*]+DELETE_PEN] is the same as choosing the predecessor with the highest alignment score with the first j characters of S.
  • i is the alignment of the first j ⁇ 1 characters of the string S with the graph up-to-and-including d, plus an INSERT_PEN, which is similar to the definition of the insertion argument in SW (see equation 1).
  • a is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including d, plus either a MATCH_SCORE (if the jth character of S is the same as the character d) or a MISMATCH_PEN (if the jth character of S is not the same as the character d).
  • MATCH_SCORE if the jth character of S is the same as the character d
  • MISMATCH_PEN if the jth character of S is not the same as the character d.
  • a is the alignment score of the first j ⁇ 1 characters of S with the graph (up-to-and-including p), i.e., M[j ⁇ 1,p], with either a MISMATCH_PEN or MATCH_SCORE added, depending upon whether d and the jth character of S match.
  • d is the first letter of the sequence of its object, there can be multiple possible predecessors.
  • maximizing ⁇ M[j,p*]+MISMATCH_PEN or MATCH_SCORE ⁇ is the same as choosing the predecessor with the highest alignment score with the first j ⁇ 1 characters of S (i.e., the highest of the candidate M[j ⁇ 1,p*] arguments) and adding either a MISMATCH_PEN or a MATCH_SCORE depending on whether d and the jth character of S match.
  • the penalties e.g., DELETE_PEN, INSERT_PEN, MATCH_SCORE and MISMATCH_PEN, can be adjusted to encourage alignment with fewer gaps, etc.
  • the operation finds the optimal (e.g., maximum) value for a sequence read 103 to the reference DAG 331 by calculating not only the insertion, deletion, and match scores for that element, but looking backward (against the direction of the graph) to any prior nodes on the graph to find a maximum score.
  • the optimal e.g., maximum
  • FIG. 3 shows matrices 301 that represent the comparison.
  • the modified Smith-Waterman operation of the invention identifies the highest score and performs a backtrack to identify the proper alignment of the sequence. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference.
  • Systems and methods of the invention can be used to provide a report that identifies a modified base at the position within the genome of the subject.
  • identifying an alignment for the genomic information from the sample may include aligning 129 (e.g., using the global alignment 106 or the local alignment 108 ) one or more of the sequence reads 103 to the reference DAG 331 , as shown in FIG. 4 .
  • the branches to which the sequence reads 103 have been aligned are identified, along with the corresponding regions within the reference genome.
  • Graph Smith-Waterman is able to achieve similar computational speed as Smith-Waterman, yet does not suffer from the combinatorial explosion associated with linearizing complex regions of DAGs. In contrast, the performance of Graph Smith-Waterman scales linearly as the complexity of the DAG increases. Additionally, like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions. However, when applied to a linear reference sequence or a DAG having low complexity, Graph Smith-Waterman can be slow compared to pattern matching algorithms.
  • FIG. 4 illustrates finding 129 alignments 401 between the sequence 103 (or a consensus sequence representing an assembly or one or more sequence reads 103 ) and the reference DAG 331 .
  • Alignments can be performed directly against the DAG using the Graph Smith-Waterman algorithm, for example; however, as will be discussed below, alignments may also be performed against the DAG using linear alignment or pattern matching algorithms by first linearizing the DAG.
  • FIG. 4 also illustrates an optional variant calling 807 step to identify a subject genotype 831 .
  • reads can be rapidly mapped to the reference DAG 331 despite their large numbers or short lengths.
  • Numerous benefits obtain by using a graph as a reference. For example, aligning against a graph is more accurate than aligning against a linear reference and then attempting to adjust one's results in light of other extrinsic information. This is primarily because the latter approach enforces an unnatural asymmetry between the sequence used in the initial alignment and other information. Aligning against an object that potentially represents all the relevant physical possibilities is much more computationally efficient than attempting to align against a linear sequence for each physical possibility (the number of such possibilities will generally be exponential in the number of junctions).
  • Many implementations provide methods that rely on a directed acyclic data structure with the data in an annotated transcriptome.
  • features e.g., exons and introns
  • An isoform from the real-world transcriptome is thus represented by a corresponding path through the DAG-based data structure.
  • a DAG is understood in the art to refer to data that can be presented as a graph as well as to a graph that presents that data.
  • the DAG can be stored as data that can be read by a computer system for bioinformatic processing or for presentation as a graph.
  • a DAG can be saved in any suitable format including, for example, a list of nodes and edges, a matrix or a table representing a matrix, an array of arrays or similar variable structure representing a matrix, in a language built with syntax for graphs, in a general markup language purposed for a graph, or others.
  • the DAG is a directed acyclic data structure
  • a reference graph may lack direction or not be acyclic.
  • Various embodiments are considered to be within the scope of the disclosure.
  • FIG. 5 shows the use of an adjacency list 502 for each vertex 505 and edge 509 .
  • a computer system 901 (shown in FIG. 10 ) creates the reference DAG 531 using an adjacency list 502 for each vertex and edge, wherein the adjacency list 502 for a vertex 505 or edge 509 lists the edges or vertices to which that vertex or edge is adjacent. Each entry in adjacency list 502 is a pointer to the adjacent vertex or edge.
  • each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored.
  • the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory and permits access to the intended data by means of pointer dereference. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer.
  • the feature that separates pointers from other kinds of reference is that a pointer's value is interpreted as a memory address, at a low-level or hardware level.
  • the speed and efficiency of the described graph genome engine allows a sequence to be queried against a large-scale genomic reference graph (e.g., the reference DAG 531 ) representing millions or billions of bases, using the computer system 901 .
  • a large-scale genomic reference graph e.g., the reference DAG 531
  • Such a graph representation provides means for fast random access, modification, and data retrieval.
  • index-free adjacency in that every element contains a direct pointer to its adjacent elements (e.g., as described in U.S. Pub. 2014/0280360 and U.S. Pub. 2014/0278590, each incorporated by reference), which obviates the need for index look-ups, allowing traversals (e.g., as done in the modified SW alignment operation described herein) to be very rapid.
  • Index-free adjacency is another example of low-level, or hardware-level, memory referencing for data retrieval (as required in alignment and as particularly pays off in terms of speed gains in the modified, multi-dimensional Smith-Waterman alignment described below).
  • index-free adjacency can be implemented such that the pointers contained within elements are references to a physical location in memory.
  • FIG. 5 is presented to illustrate a useful format.
  • methods of the invention use the stored graph (e.g., the reference genome) with sequence reads that are obtained from a subject.
  • sequence reads or obtained as an electronic article e.g., uploaded, emailed, or FTP transferred from a lab to the computer system 901 .
  • sequence reads are obtained by sequencing.
  • a DAG is stored as a list of nodes and edges.
  • One such way is to create a text file that includes all nodes, with an ID assigned to each node, and all edges, each with the node ID of starting and ending node.
  • a case-insensitive text file could be created.
  • Any suitable format could be used.
  • the text file could include comma-separated values. Naming this DAG “Jane” for future reference, in this format, the DAG “Jane” may read as follows: 1 see, 2 run, 3 jane, 4 run, 1-3, 2-3, 3-4.
  • this structure is applicable to the introns and exons represented in FIGS. 7 and 8 , and the accompanying discussion below.
  • a DAG is stored as a table representing a matrix (or an array of arrays or similar variable structure representing a matrix) in which the (i,j) entry in the N ⁇ N matrix denotes that node i and node j are connected (where N is a vector containing the nodes in genomic order).
  • N is a vector containing the nodes in genomic order.
  • a 0 entry represents that no edge is exists from node i to node j
  • a 1 entry represents an edge from i to j.
  • a matrix structure allows values other than 0 to 1 to be associated with an edge.
  • any entry may be a numerical value indicating a weight, or a number of times used, reflecting some natural quality of observed data in the world.
  • a matrix can be written to a text file as a table or a linear series of rows (e.g., row 1 first, followed by a separator, etc.), thus providing a simple serialization structure.
  • Embodiments of the invention include storing a DAG in a language built with syntax for graphs.
  • the DOT Language provided with the graph visualization software packages known as Graphviz provides a data structure that can be used to store a DAG with auxiliary information and that can be converted into graphic file formats using a number of tools available from the Graphviz web site.
  • Graphviz is open source graph visualization software.
  • Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains.
  • the Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images and SVG for web pages; PDF or Postscript for inclusion in other documents; or display in an interactive graph browser.
  • a DAG is stored in a general markup language purposed for a graph, or others.
  • a language such as XML can be used (extended) to create labels (markup) defining nodes and their headers or IDs, edges, weights, etc.
  • a DAG is structured and stored, embodiments of the invention involve using nodes to represent features such as exons and introns. This provides a useful tool for analyzing sequence reads and discovering, identifying, and representing isoforms.
  • isoforms can be represented by paths through those fragments.
  • An exon's being “skipped” is represented by an edge connecting some previous exon to some later one.
  • a DAG is a multi-dimensional structure and not immediately amenable to alignment with a linear alignment tool.
  • a reference DAG can be linearized into a format appropriate for the algorithm. In certain embodiments, this can be performed by enumerating each of the possible paths in the subsection of a reference DAG.
  • FIG. 6 illustrates an exemplary DAG 601 representing a subsection of the reference DAG for a candidate mapping position (e.g., a local alignment position).
  • a depth-first search can be performed by starting at the first node and taking the left-most path.
  • the search returns to the previous node with an untaken path and attempts to take the left-most path again.
  • Each path is associated with a sequence represented by the nodes and edges in the path.
  • the sequence read can then be compared against each of the paths. For example, a depth-first search would identify the following paths and associated sequences from the subsection of the DAG 661 shown in the embodiment of FIG. 6 :
  • systems and methods according to the disclosure may enumerate the possible paths for a reference DAG or subsection of a reference DAG, identify a sequence associated with each possible path, and provide each sequence to an alignment tool (such as those disclosed herein) for alignment with a sequence read.
  • the subsection of the reference DAG could be linearized by walking the graph using a sliding window of n bases (wherein n is the length of the sequence read) and performing a comparison of the sequence read to the reference sequence in the window.
  • n is the length of the sequence read
  • the walk can similarly be performed using a depth-first search. For example, for the sequence read GTTTAG and subsection of the reference DAG 601 in the embodiment of FIG. 6 , a local alignment could proceed as shown in Table 1.
  • the highest scoring matches can be identified to obtain the locally aligned position for the sequence read. As shown in Table 1, a match occurs at the 14 th comparison. While a depth-first search is described for traversing the graph, various other searching algorithms or techniques may be used. For example, a breadth-first search could be substituted. Similarly, one could also employ an algorithm following the right-most path, or any previously unfollowed path. Various embodiments are considered to be within the scope of the disclosure.
  • pattern matching algorithms are particularly well suited for aligning sequence reads to DAGs. As known variations are already present in the graph, the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read. Instead, the pattern matching algorithm simply considers each path available through the graph, such as in the manner described above. However, as noted above, while pattern matching algorithms are fast, the performance of such an algorithm (and other local alignment algorithms) can be affected based on the complexity of the DAG surrounding the local alignment position.
  • the complexity of a DAG at a given region indicates the amount of variation in the reference genome at that position.
  • the complexity of a DAG can be determined in a variety of ways.
  • the complexity of a DAG can be the number of nodes, number of vertices, number of branches, number of possible paths, and the like. Complexity may be determined for a subsection or region of the DAG, or for the entire DAG.
  • a DAG 701 shown in FIG. 7 includes information for a single nucleotide polymorphism (C or D) and an insertion (EG or EFG). There are a total of seven nodes and four paths through the DAG.
  • a DAG 801 shown in FIG. 8 indicates a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions. In this DAG 801 , there are 10 nodes and 21 possible paths.
  • the DAG complexity is low.
  • a local linear sequence alignment or pattern matching algorithm may be faster and more efficient than a graph aware algorithm.
  • the additional number of paths that can be tested by the algorithm does not significantly impact the advantages in speed gained by using a pattern matching algorithm (e.g., by using a shift table to avoid unnecessary computations).
  • the required number of comparisons can be performed in less processing time than by generating a plurality of matrices for each node, as required by Graph Smith-Waterman.
  • Graph Smith-Waterman may be more efficient than a linear sequence alignment or pattern matching algorithm because it does not suffer from the combinatorial explosion problem associated with linearizing complex DAGs.
  • FIG. 9 The relationship between the alignment processing time and local DAG complexity is shown in FIG. 9 as a graph 900 .
  • Graph 900 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware algorithm (such as Graph Smith-Waterman).
  • a pattern matching algorithm such as Tarhio-Ukkonen
  • Graph Smith-Waterman a graph aware algorithm
  • performance of Tarhio-Ukkonen and Graph Smith-Waterman is equal and either algorithm may be used without a deviation in performance.
  • Tarhio-Ukkonen becomes more inefficient due to the large number of paths that must be individually tested, and Graph Smith-Waterman may be preferred.
  • Tarhio-Ukkonen is more efficient and requires less processing time.
  • Price may also be a consideration.
  • cloud computing services such as Amazon AWS and Microsoft Azure offer cloud-based computational resources at a cost.
  • Tarhio-Ukkonen only when its performance is less than Graph Smith-Waterman in order to reduce processing time and increase cost efficiency.
  • a sequence read could be realigned depending on the resulting quality of the local alignment. For example, if a sequence read has an unknown variation (such as an insertion or deletion), and the complexity of a DAG at a candidate mapping position is low and a pattern matching algorithm is used, the resulting alignment may have low quality. This is because pattern matching algorithms, while fast, may not gracefully handle unknown variations. In these cases, an optimal alignment algorithm (such as Smith Waterman) could be subsequently performed to accurately identify the variation and improve the local alignment. Similarly, in high entropy or high variability areas of the genome, an optimal alignment algorithm could be substituted for a pattern matching algorithm to improve the quality of the alignment. The quality of an alignment may be expressed according to the Phred scale, for example.
  • more than two alignment algorithms may be selected based on the complexity. For example, one may consider using a pattern matching algorithm, an optimal alignment algorithm, and a graph-aware algorithm as complexity increases. In this way, systems and methods according to the invention may select from two, three, or more local alignment algorithms to significantly improve the performance of next-generation sequencing alignment using graph data structures.
  • FIG. 10 illustrates a computer system 901 suitable for performing methods of the invention.
  • the computer system 901 includes at least one computer 633 .
  • the computer system 901 may further include one or more of a server computer 909 and a sequencer 955 , which may be coupled to a sequencer computer 951 .
  • Each computer in the computer system 901 includes a processor 935 , 936 , or 937 coupled to a memory device and at least one input/output device.
  • the computer system 901 includes at least one processor 935 coupled to a memory subsystem 975 (e.g., a memory device or collection of memory devices).
  • the computer system 901 is operable to obtain a sequence generated by sequencing nucleic acid from a genome of a patient.
  • the system uses the processor to transform the sequence read 103 information into the reference DAG 331 .
  • processor refers to any device or system of devices that performs processing operations.
  • a processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU).
  • CPU central processing unit
  • a processor may be provided by a chip from Intel or AMD.
  • a processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
  • the memory subsystem 975 contains one or any combination of memory devices.
  • a memory device is a mechanical device that stores data or instructions in a machine-readable format.
  • Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein.
  • each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.
  • SIM subscriber identity module
  • SD card secure digital card
  • SSD solid-state drive
  • the computer system 901 is operable to produce a report and provide the report to a user via an input/output device.
  • An input/output device is a mechanism or system for transferring data into or out of a computer.
  • Exemplary input/output devices include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), a printer, an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a speaker, a touchscreen, an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device, which can be, for example, a network interface card (NIC), Wi-Fi card, or cellular modem.
  • NIC network interface card
  • Wi-Fi card Wireless Fidelity
  • the reference DAG is stored in the memory subsystem using adjacency techniques, which may include pointers to identify a physical location in the memory subsystem 675 where each vertex is stored.
  • the graph is stored in the memory subsystem 675 using adjacency lists.

Abstract

Systems and methods for analyzing genomic information can include obtaining a sequence read including genetic information; identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment can include a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the mapped position of the sequence read can be identified within the genome.

Description

    FIELD OF THE INVENTION
  • The invention relates to systems and methods for adaptive local alignment for graph genomes.
  • BACKGROUND
  • A person's genetic information has the potential to reveal much about their health and life. A risk of cancer or a genetic disease may be revealed by the sequences of the person's genes, as well as the possibility that his or her children could inherit a genetic disorder. Genetic information can also be used to identify an unknown organism, such as potentially infectious agents discovered in samples from public food or water supplies. Next-generation sequencing (NGS) technologies are available that can sequence entire genomes quickly. Some approaches to analyzing sequence reads involve mapping the sequence reads to a reference genome. Mapping reads to a reference can be done by aligning each read to a relevant portion of the reference.
  • SUMMARY
  • The described methods and systems provide a method for automatically analyzing sequence data using a graph, such as a reference directed acyclic graph (DAG), and accounting for the various difficulties (e.g., complexities) of sequence alignment and genomic variation without requiring intermediate input or guidance from a user. Essentially, the described methods and systems simplify and optimize the process for sequence alignment between a sample sequence and a reference genome because the systems and methods function autonomously from the user (after the sequence information is received). A processor can, for example, identify and align the sequence information in a manner that meets a user's requirements for accuracy and expediency. In some cases, the sequence information can be directly transmitted to the systems described herein, which helps to provide a vastly simplified sequence alignment experience for the user.
  • Representing genomes as graphs, such as directed acyclic graphs (DAGs), is a powerful tool for coping with the complexities of sequence alignment and genomic variation. In contrast to a single linear reference sequence, a DAG can incorporate myriad types of information about genomic variations, such as single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and larger structural variants (SVs). Each of these variations can be represented by following different branches through the DAG. Additionally, using DAGs can help to increase the probability of identifying small variations near large known structural variants because the presence of the known structural variants in the graph improves the quality of the alignment. Thus, known variations can be reliably accounted for and identified by aligning reads containing the known variations to a sequence path that includes those variations.
  • However, the complex nature of DAGs can result in a combinatorial explosion when used with conventional linear alignment tools thereby rendering alignment to a DAG computationally expensive and cost prohibitive. The methods and systems described throughout this document can automatically and efficiently analyze sequence data using the reference DAG even when doing so would conventionally be undesirable (e.g., computationally expensive, time consuming, imprecise, and/or cost prohibitive). The systems and methods can analyze the difficulty of aligning the sequence data with the reference DAG based on a predicted alignment difficulty (e.g., due to the DAG complexity, sequence length, total processing time, remaining processing time, or any combination thereof) between the reference DAG and the sequence data. By choosing and employing a selected alignment algorithm based on this predicted alignment difficulty, the methods and systems can efficiently analyze sequence data.
  • In one aspect of the invention, a method includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph including nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment includes a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the method further includes identifying the mapped position of the sequence read within the reference genome.
  • In another aspect of the invention, the system for determining a subject's genetic information, the system includes a computer system comprising a processor coupled to memory and operable to: receive identities of a plurality of nucleotides at known locations on a reference genome; receive sequence reads from a sample from a subject; and map the sequence reads to the reference genome, thereby identifying a corresponding location on the reference genome. The mapping includes identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes; determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic; and performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic. The advanced local alignment includes a first-local-alignment algorithm, and the basic local alignment includes a second-local-alignment algorithm. Based on the local alignments, the system identifies the mapped position of the sequence read within the reference genome.
  • Implementations can include one or more of the following features.
  • In some implementations, the method further includes performing, by means of the computer system, a global alignment for the sequence read based on whether the global alignment is advanced or basic. The advanced global alignment includes a first-global-alignment algorithm, and the basic global alignment includes alignment includes a second-global-alignment algorithm. The global alignment provides information indicative of the plurality of candidate mapping positions.
  • In certain implementations, the first-local-alignment algorithm is different from the second-local-alignment algorithm.
  • In some implementations, the first-global-alignment algorithm is different from the second-global-alignment algorithm.
  • In certain implementations, at least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
  • In some implementations, the method further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
  • In certain implementations, a complex graph includes 10 or more nodes.
  • In some implementations, a simple graph includes 5 or fewer nodes.
  • In certain implementations, the first-local-alignment algorithm includes a pattern matching algorithm.
  • In some implementations, the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm.
  • In certain implementations, the second-local-alignment algorithm includes a linear alignment algorithm.
  • In some implementations, each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the reference genome.
  • In certain implementations, candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
  • In some implementations, the method further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
  • In certain implementations, each of the plurality of candidate mapping positions defines a selected path through the graph.
  • In some implementations, the processor is further operable to: perform a global alignment, by means of the computer system, for the sequence read based on whether the global alignment is advanced or basic. The advanced global alignment includes a first-global-alignment algorithm, and the basic global alignment includes a second-global-alignment algorithm. The global alignment provides information indicative of the plurality of candidate mapping positions.
  • In certain implementations, each node and edge stores a list of one or more adjacent objects, and wherein identifying the plurality of candidate mapping positions comprises finding related alignments between the sequence read and the candidate mapping position.
  • In some implementations, the complexity of the graph varies with a remaining processing time estimate to reduce a total processing time.
  • In certain implementations, the first-local-alignment algorithm is different from the second-local-alignment algorithm.
  • In some implementations, the first-global-alignment algorithm is different from the second-global-alignment algorithm.
  • In certain implementations, at least one of the first-local-alignment algorithm and the second-local-alignment algorithm is different from at least one of the first-global-alignment algorithm and the second-global-alignment algorithm.
  • In some implementations, the mapping further includes determining whether the local alignment is advanced or basic based at least in part on at least one of: the complexity of the graph, a length of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
  • In certain implementations, a complex graph includes 10 or more nodes.
  • In some implementations, a simple graph includes 5 or fewer nodes.
  • In certain implementations, the first-local-alignment algorithm includes a pattern matching algorithm.
  • In some implementations, the pattern matching algorithm includes at least one of the following: a Boyer-Moore algorithm and a Tarhio-Ukkonen algorithm.
  • In certain implementations, the second-local-alignment algorithm includes a linear alignment algorithm.
  • In some implementations, the candidate mapping positions relate to the genetic information if a predetermined alignment quality is met between the candidate mapping position and the sequence read.
  • In certain implementations, the mapping further includes ranking each of the candidate mapping positions based on the quality of the local alignment.
  • In some implementations, each of the plurality of candidate mapping positions defines a selected path through the graph.
  • Implementations can include one or more of the following advantages.
  • Many of the methods described herein can account for the complexity of a reference graph by adjusting the alignment methods based on the complexity. Regions of a reference graph with a large number of nodes will typically result in an exponential number of calculations, which leads to excessive cost and processing time during the alignment process. Other sequence alignment methods, such as the Graph Smith-Waterman algorithm, have been adapted to accommodate this complexity and to work with DAGs (e.g., as disclosed in U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference); however, these methods have a high initial computational cost and therefore are not amenable for use with large data sets. By adjusting the alignment methods based on the complexity of the reference DAG, an optimal alignment method can be used. For example, if genomic information is being aligned with a relatively simple reference graph, a linear alignment method can be used to save resources (e.g., time and computational costs). In other cases, if the reference graph is relatively complex, other sequence alignment methods, such as the modified Graph Smith-Waterman algorithm may be an alignment tool well-suited to the task despite the high initial computational costs.
  • In some implementations, the predicted alignment difficulty can vary during the alignment process. The alignment techniques described herein can continually assess the alignment difficulty and dynamically adapt the alignment tools in response to any changes. For example, two or more alignment tools can be used during the same alignment process.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 diagrams a method for aligning sequence reads to a reference genome.
  • FIG. 2 diagrams a method for performing a global alignment of a sequence read to a reference genome.
  • FIG. 3 shows the matrices that represent the comparison.
  • FIG. 4 illustrates finding alignments between a sequence and a reference DAG.
  • FIG. 5 shows the use of an adjacency list for each vertex and edge in a DAG.
  • FIG. 6 shows an exemplary DAG and sequence read.
  • FIG. 7 shows an exemplary DAG representing a single nucleotide polymorphism (C or D) and an insertion (EG or EFG).
  • FIG. 8 shows an exemplary DAG representing a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions.
  • FIG. 9 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware alignment algorithm (such as a Graph Smith-Waterman alignment method).
  • FIG. 10 illustrates a system for implementing the methods diagramed in FIGS. 1 and 2.
  • DETAILED DESCRIPTION
  • In general, the methods and systems described herein relate to determining a subject's genetic information. Information representing at least a portion of the subject's DNA (e.g., a sequence read) is globally aligned with a graph, such as a reference genomic directed acyclic graph (DAG), to identify candidate paths or mapping positions through the genomic DAG. The genomic information can then be compared against the candidate paths with a higher degree of accuracy as compared to the original global alignment. This two-step alignment analysis allows for the selection of an appropriate alignment algorithm based on the alignment difficulty and an acceptable alignment tolerance, thereby reaping considerable computational savings resources because resources are not needlessly expended.
  • Additionally, the present disclosure recognizes, in certain embodiments, that the problem of combinatorial explosion associated with aligning sequence data to genomic DAGs is solved by an alignment method that analyzes the complexity of a DAG and chooses an alignment algorithm based on the complexity. For example, sequence reads can be quickly aligned to portions of a DAG that have low complexity using a pattern matching algorithm or a linear alignment algorithm. However, for complex portions of the DAG, the number of paths that must be analyzed by a pattern matching or linear alignment algorithm grows exponentially, leading to a corresponding exponential decrease in performance of the pattern matching or linear alignment tool. In these cases, a graph-aware algorithm may be substituted. The graph-aware algorithm may have a high initial overhead cost, but its decrease in performance scales linearly as the complexity of the graph increases, making it a preferred choice for complex regions of a DAG.
  • The following describes methods for indexing collections of data strings represented in a form of DAGs and performing efficient string search including exact and approximate matching techniques. Among other applications, these methods can be used to quickly and efficiently find genomic data sequences (e.g., reads produced by DNA sequencing machines) in genome graphs representing genome data variations.
  • FIG. 1 diagrams a method 101 for analyzing genomic sequence information. At step 102 sequence reads from a sample from a subject are received. Sequence reads can be received from a nucleic acid sequencing instrument, such as an Illumina® HiSeq2500, a Roche 454 GS FLX+, an Ion Torrent PGM™, and the like. A processor coupled to the tangible memory device is used to globally align 104 the sequence read against the reference genome to identify 106 one or more candidate mapping positions. These candidate mapping positions represent acceptable alignments between the received sequence reads and the paths through the reference DAG (e.g., a location within the reference genome that could be related to the sequence read). In some cases, a report can be provided that identifies a one or more of the candidate mapping positions.
  • Next-generation sequencing and alignment can be computationally expensive. For example, over 900 million paired-end 100 bp sequence reads are needed to obtain 30× coverage of the human genome. As will be described in further detail below, certain algorithms, such as various pattern matching and linear alignment algorithms, scan each sequence read over a reference sequence to identify an exact, optimal, or best-fit match. But this approach is typically computationally infeasible given the scales associated with next-generation sequencing. To solve this problem, a heuristic approach may be used to reduce the total number of computations required. For example, one approach is to separate an alignment into “coarse” and “fine” stages. In the coarse stage, a fast, yet often incomplete or inexact alignment is performed to identify a number of candidate regions to which a particular sequence read aligns. This is referred to as a global alignment or a global search because the read has aligned considering the entirety of the reference sequence. In the second, fine stage, each candidate region is subsequently analyzed using a more computationally expensive, yet exact algorithm. This is referred to as a local alignment or local search because the alignment is restricted to the local space identified during the coarse stage. In certain examples, this may be referred to as a “seed and extend” strategy; the global alignment yields a number of seed positions, which are then extended during the local alignment stage. The candidate positions may then be scored and ranked, yielding the best aligned position for each sequence read.
  • For example, one approach for performing a global alignment 104 is to place many small sections, or k-mers, of the reference DAG into a computer hash function. The hash function calculates an index as a function of the k-mer. The index identifies an entry in a hash table, and the positions of the k-mer within the reference DAG are stored in that entry, i.e., in the entry indexed by the hash of the k-mer. A sequence read is analyzed by hashing some or all of its k-mers and the corresponding entries of the DAG hash table are accessed to read positions within the DAG where those sequence read k-mers can be found. Sections of the DAG where a threshold number of k-mer positions are found are identified as candidate regions within the represented genomes for alignment or mapping of the sequence read. By the described implementation, a sequence read can be mapped to a large number of “good fit” positions within a reference genome very rapidly.
  • A sequence read can be quickly mapped to a reference using a global alignment algorithm because hash values can be retrieved rapidly (i.e., without iterating over entries in an array). Since a DAG can store a great amount of genetic variation, a sequence read can be mapped to an appropriate portion of one or more certain genomes from among numerous candidate genomes very rapidly. Since sequence reads can be mapped to biologically relevant reference genomes very rapidly, a patient's genetic information can be discerned, or genetic sequences can be identified quickly even against a background of all genetic variety represented in the system. For example, aligning a patient's sequence reads to a DAG representing variations in cancer genomes (such as those from The Cancer Genome Atlas) can be used to quickly identify particular mutations associated with cancer present in the patient.
  • Once a plurality of candidate mapping positions has been identified 106 by a global searching algorithm 104, a subsection of the graph at each candidate mapping position is identified. In some examples, the subsection of the graph includes at least the number of base pairs in the sequence read for searching (e.g., 50, 100, 200 bp). However, longer subsections can be preferable for local search. Using longer subsections is particularly advantageous when using paired-end or mate-pair sequence reads, which are pairs of sequence reads describing the 5′ and 3′ ends of a single DNA molecule, often with an inferred insert size distance of 300-600 (paired-end) or 2,000-5,000 (mate pair) base pairs between the 5′ and 3′ sequence reads. Accordingly, the portions or subsections of the graph for each of the candidate mapping positions can be a variety of sizes ranging from 50 to even 10,000 base pairs.
  • The processor can then perform 108 a local alignment for each of the candidate mapping positions. With the candidate mapping positions (e.g., “good fit” positions) identified, a processor can perform 108 a local alignment at each of the candidate mapping positions to narrow the candidate mapping positions to identify one or more mapped positions (e.g., “better fit” or “best fit” positions). A processor can start 110 the local alignment process and then determine 112 whether an alignment with the reference DAG surrounding the candidate mapping positions (e.g., the subset of the reference DAG surrounding a candidate mapping position) is advanced or basic. In some examples, an alignment is advanced if the complexity of the subset of the reference DAG exceeds 5 paths. In other examples, the alignment is considered basic if the complexity of the subset of the reference DAG is 5 paths or less. If the alignment is advanced 114, then the processor can use a first alignment tool implementing a first method 116. If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 118. The processor can repeat 120 this process until all candidate mapping positions are locally aligned. The processor can then rank 122 the candidate mapping positions to identify the most likely alignment positions corresponding to the received sequence read before ending 124 the local alignment process. Based on these rankings, the processor can identify 126 the mapped position within an acceptable margin of error.
  • While the method 101 comprises performing 104 a global alignment against a reference DAG, in certain embodiments, a global alignment may be simply performed against a linear reference sequence. This can be advantageous in certain embodiments where using a linear reference sequence for the global alignment can help improve performance. However, the local alignment 108 may still consider a reference DAG structure. Various embodiments are considered to be within the scope of the disclosure.
  • In certain embodiments, a global alignment may also consider whether an alignment with the DAG is advanced or basic and select an appropriate algorithm accordingly. FIG. 2 illustrates an exemplary global alignment process 201 for identifying candidate mapping positions for a sequence read. As with the local alignment process 108, the global alignment process 201 can evaluate the reference DAG to determine if the global alignment is advanced or basic. For example, this may be performed in situations wherein the reference DAG is small enough such that pattern matching and linear alignment algorithms may be used for the global alignment stage. For example, a processor can start 202 the local alignment process and then determine 204 whether an alignment with the reference DAG and the sequence read is advanced or basic. In some examples, an alignment is advanced if the complexity of the DAG exceeds 6 paths, as discussed below. In other examples, the alignment is basic if the complexity of the DAG is 6 paths or less. If the alignment is advanced 206, then the processor can use a first alignment tool implementing a first method 208. If the alignment is not advanced (i.e., basic), then the processor can use a second alignment tool implementing a second method 210. The first method 208 and the second method 210 may be the same, different, or partially overlap. In some examples, the first method 208 and/or the second method 210 can correspond to at least one of the first method 116 or the second method 118. Using these alignment tools, the processor can identify 212 the candidate mapping positions. The processor can then repeat this process until the global alignment is complete 214. The processor can then end 216 the global alignment process 201.
  • If the alignment is basic 118, the second alignment tool may comprise linear alignment methods, such as optimal alignment or pattern matching algorithms. For example, the second alignment tool may comprise an optimal alignment algorithm, such as a Smith-Waterman algorithm, a Needleman-Wunsch algorithm, and the like. These algorithms use a dynamic programming approach in order to find an optimal local alignment between two sequences, but are computationally expensive. To improve performance, pattern matching or string matching algorithms, such as a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm may be used. These algorithms are fast, but typically do not handle variations (such as indels and SNPs) as well as an optimal alignment algorithm. However, pattern matching and string matching algorithms are particularly well suited for aligning sequence reads against graph genomes, such as DAGs. As known variations are already present in the graph, the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read; it need only map the sequence read to the most likely branch within the DAG having that variation.
  • While pattern matching and linear alignment algorithms can function as a fast local alignment tool, each of these methods is easily susceptible to efficiency loss when faced with a high number of comparisons when, for example, a linearized graph genome represents a significant number of combinations. Accordingly, if the alignment is advanced 114, the first alignment tool may comprise a graph-aware or multi-dimensional algorithm, such as Graph Smith-Waterman, as explained in further detail below.
  • Optimal Alignment Algorithms
  • Pairwise alignment generally involves placing one sequence along part of target, introducing gaps according to an algorithm, scoring how well the two sequences match, and preferably repeating for various positions along the reference. The best-scoring match is deemed to be the alignment and represents an inference of homology between alignment portions of the sequences. In some embodiments, scoring an alignment of a pair of nucleic acid sequences involves setting values for the probabilities of substitutions and indels. When individual bases are aligned, a match or mismatch contributes to the alignment score by a substitution score, which could be, for example, 1 for a match and −0.33 for a mismatch. An indel deducts from an alignment score by a gap penalty, which could be, for example, −1. Gap penalties and substitution probabilities can be based on empirical knowledge or a priori assumptions about how sequences evolve. Their values affect the resulting alignment. Particularly, the relationship between the gap penalties and substitution probabilities influences whether substitutions or indels will be favored in the resulting alignment.
  • Stated formally, an alignment represents an inferred relationship between two sequences, x and y. For example, in some embodiments, an alignment A of sequences x and y maps x and y respectively to another two strings x′ and y′ that may contain spaces such that: (i) |x′|=|y′|; (ii) removing spaces from x′ and y′ should get back x and y, respectively; and (iii) for any i, x′[i] and y′[i] cannot be both spaces.
  • A gap is a maximal substring of contiguous spaces in either x′ or y′. An alignment A can include the following three kinds of regions: (i) matched pair (e.g., x′[i]=y′[i]; (ii) mismatched pair, (e.g., x′[i]—y′[i] and both are not spaces); or (iii) gap (e.g., either x′[i . . . j] or y′[i . . . j] is a gap). In certain embodiments, only a matched pair has a high positive score a. In some embodiments, a mismatched pair generally has a negative score b and a gap of length r also has a negative score g+rs where g, s<0. For DNA, one common scoring scheme (e.g. used by BLAST) makes score a=1, score b=−3, g=−5 and s=−2. The score of the alignment A is the sum of the scores for all matched pairs, mismatched pairs and gaps. The alignment score of x and y can be defined as the maximum score among all possible alignments of x and y.
  • Any pair may have a score a defined by a 4×4 matrix B of substitution probabilities. For example, B(i,i)=1 and 0<B(i,j)<1 [for i≠j], is one possible scoring system. For instance, where a transition is thought to be more biologically probable than a transversion, matrix B could include B(C,T)=0.7 and B(A,T)=0.3, or other values desired or determined by methods known in the art.
  • A pairwise alignment, generally, involves—for sequence Q (query) having m characters and a reference genome T (target) of n characters—finding and evaluating possible local alignments between Q and T. For any 1≦i≦n and 1≦j≦m, the largest possible alignment score of T[h . . . i] and Q[k . . . j], where h≦i and k≦j, is computed (i.e. the best alignment score of any substring of T ending at position i and any substring of Q ending at position j). This can include examining all substrings with cm characters, where c is a constant depending on a similarity model, and aligning each substring separately with Q. Each alignment is scored, and the alignment with the preferred score is accepted as the alignment. One of skill in the art will appreciate that there are exact and approximate algorithms for sequence alignment. Exact algorithms will find the highest scoring alignment, but can be computationally expensive. Two well-known exact algorithms are Needleman-Wunsch (J Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol, 147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3), 705-708, 1982) reduces the calculation time from O(m2n) to O(mn) where m and n are the sequence sizes being compared and is more amendable to parallel processing. In the field of bioinformatics, it is Gotoh's modified algorithm that is often referred to as the Smith-Waterman algorithm. Smith-Waterman approaches are being used to align larger sequence sets against larger reference sequences as parallel computing resources become more widely and cheaply available. See, e.g., Amazon's cloud computing resources. All of the journal articles referenced herein are incorporated by reference in their entireties.
  • The original Smith-Waterman (SW) algorithm is a linear alignment algorithm that aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences. Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence. Furthermore, because SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
  • The original SW algorithm is expressed for an n×m matrix H, representing the two strings of length n and m, in terms of equation (1):

  • H_k0=H_01=0 (for 0≦k≦n and 0≦l≦m)

  • H_ij=max{H_(i−1,j−1)+s(a_i,b_j),H_(i−1,j)−W_in,H_(i,j−1)−W_del,0}  (1)
  • (for 1≦i≦n and 1≦j≦m)
  • In the equations above, s(ai,bj) represents either a match bonus (when ai=bj) or a mismatch penalty (when ai≠bj), and insertions and deletions are given the penalties Win and Wdel, respectively. In most instances, the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
  • Once the matrix has been fully populated with scores, the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi−1,j−1, Hi−1,j, or Hi,j−1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached. The optimal-scoring alignment may contain greater than the minimum possible number of insertions and deletions, while containing far fewer than the maximum possible number of substitutions.
  • SW or SW-Gotoh may be implemented using dynamic programming to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively. This dynamic programming employs tables or matrices to preserve match scores and avoid re-computation for successive cells. Each element of the string can be indexed with respect to a letter of the sequence, that is, if S is the string ATCGAA, S[1]=A.
  • Instead of representing the optimum alignment as Hi,j (above), the optimum alignment can be represented as B[j,k] in equation (2) below:

  • B[j,k]=max(p[j,k],i[j,k],d[j,k],0) (for 0<j≦m,0<k≦n)  (2)
  • The arguments of the maximum function, B[j,k], are outlined in equations (3)-(5) below, wherein MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN, DELETION_PEN, and OPENING_PEN are all constants, and all negative except for MATCH_BONUS (PEN is short for PENALTY). The match argument, p[j,k], is given by equation (3), below:

  • p[j,k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MISMATCH_PEN, if S[j]A[k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MATCH_BONUS, if S[j]=A[k]  (3)
  • the insertion argument i[j,k], is given by equation (4), below:

  • i[j,k]=max(p[j−1,k]+OPENING_PEN,i[j−1,k],d[j−1,k]+OPENING_PEN)+INSERTION_PEN  (4)
  • and the deletion argument d[j,k], is given by equation (5), below:

  • d[j,k]=max(p[j,k−1]+OPENING_PEN,i[j,k−1]+OPENING_PEN,d[j,k−1])+DELETION_PEN  (5)
  • For all three arguments, the [0,0] element is set to zero to assure that the backtrack goes to completion, i.e., p[0,0]=i[0,0]=d[0,0]=0.
  • The scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations. One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
  • MATCH_BONUS: 10
  • MISMATCH_PEN: −20
  • INSERTION_PEN: −40
  • OPENING_PEN: −10
  • DELETION_PEN: −5
  • The relationship between the gap penalties (INSERTION_PEN, OPENING_PEN) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost. Of course, alternative relationships between MISMATCH_PEN, MATCH_BONUS, INSERTION_PEN, OPENING_PEN and DELETION_PEN are possible.
  • Smith-Waterman and Needleman-Wunsch are exact alignment algorithms and identify optimal alignments, including those with short insertions and deletions. However, it can be as computationally expensive as naïve pattern matching algorithms, and has a processing time typically on the order of O(mn). As a result, the use of the Smith-Waterman algorithm is primarily reserved for local alignment. However, when applied to a multi-dimensional data structure such as a reference DAG, these algorithms can take an excessive amount of time to complete. In these cases, more efficient pattern matching algorithms may be substituted.
  • Pattern Matching Algorithms
  • Pattern matching algorithms use mathematical techniques to reduce the total number of pairwise comparisons in order to improve efficiency. In a short read alignment, a short read P having n nucleotides is aligned with the left-most portion of a reference sequence T, creating a window of length n. After a match or mismatch determination, the window is shifted to the right for additional comparisons. Pattern matching algorithms increase local alignment efficiency by shifting the window farther to the right that a single character, i.e., by skipping a certain number of comparisons which would result in a mismatch. In particular, some pattern matching algorithms process a sequence read, or pattern, in order to create a “shift table” that indicates how many comparisons can safely be skipped depending on the mapping of the short read at the current position.
  • To create a shift table, many pattern matching algorithms operate in two phases: a pre-processing phase and a search phase. The pre-processing phase typically processes the pattern itself to generate information that can aid the search phase by providing a table of shift values after each comparison. Improvement in efficiency of a search can be achieved by altering the order in which the characters are compared at each attempt, and by choosing a shift value that permits the skipping of a predefined number of characters in the text after each attempt. The search phase then uses this information to ignore certain character comparisons, reducing the total number of comparisons and thus the overall execution time.
  • The Boyer-Moore algorithm is one of the most efficient. See Boyer, R. S., Moore, J. S., “A Fast String Searching Algorithm”, Comm. ACM 20(10), 762-772. Boyer-Moore preprocesses a pattern to create two tables: a Boyer-Moore bad character (bmBc) table and a Boyer-Moore good suffix (bmGs) table. For each character in the alphabet (e.g., “A”, “C”, G″, and “T”, representing the four possible nucleotides), the bad character table stores a shift value based on the occurrence of the character in the pattern (i.e., short read). The good-suffix table stores the matching shift value for each character in the read. The maximum of the shift value in either table is considered after each comparison during a search, which is then used to advance the window. Boyer-Moore does not allow any mismatches between the read and the reference sequence, and thus is a perfect-match search algorithm.
  • Boyer-Moore serves as the basis for several pattern matching algorithms that can be used for sequence alignment. One variation of Boyer-Moore is the Horspool algorithm. See Horspool, R. N., “Practical Fast Searching in Strings”, Software—Practice & Experience, 10, 501-506. Horspool simplifies Boyer-Moore by recognizing that the bad-character shift of the right-most character of the window can be sufficient to compute the value of the shift. These shift values are then determined in the pre-processing stage for all of the characters in the alphabet, i.e., the possible nucleotides. Horspool is particularly efficient in practical situations when the length of the pattern is small, such as for sequence reads generated by next-generation sequencing. Like Boyer-Moore, Horspool is a perfect-match search algorithm.
  • Another variation of Boyer-Moore that allows for a certain number of mismatches is the Tarhio-Ukkonen algorithm. See Tarhio, J., Ukkonen, E., “Approximate Boyer-Moore String Matching”, SIAM J. Comput., 22, 243-260. Because it allows for mismatches, Tarhio-Ukkonen can be used to provide approximate alignments for next generation sequencing reads to a reference sequence. During a pre-processing phase, Tarhio-Ukkonen calculates shift values for every character that take into account a desired allowable number of mismatches, k. Tarhio-Ukkonen is able to identify matches in expected time
  • ( kn ( 1 m - k + k c ) ) ,
  • where c is the size of the alphabet. The use of approximation in pattern matching algorithms such as Tarhio-Ukkonen is particularly useful for next generation sequencing alignments due to inherent noise present in sequence reads generated by most next generation sequencing technologies. Furthermore, because the primary goal of most genomic sequence analyses is to identify unknown variations (primarily, SNPs) that may be present in a sample, approximate pattern matching algorithms help to identify novel polymorphisms in sequence data. In contrast, the Boyer-Moore and Horspool algorithms would reject such reads if there were any differences between the read and the reference.
  • When used for next generation sequencing alignment, pattern matching algorithms can be used as a fast local alignment tool. In particular, pattern matching algorithms typically have performance that exceeds that of traditional local alignment algorithms for sequence data, such as Needleman-Wunsch and Smith-Waterman. However, when applied to a graph genome, combinatorial complexity can quickly increase the number of comparisons to a point in which improvements in efficiency are lost. In other words, for a local alignment to a subset of a reference DAG, the expected time for the Boyer-Moore family is multiplied by the number of linearized paths. Additionally, despite variants that are able to address mismatches and small insertions and deletions, the Boyer-Moore family of algorithms do not perform particularly well when handling alignments involving novel variations (that are not already present in the graph). Exact alignment algorithms, such as Needleman-Wunsch and Smith-Waterman are able to accurately identify variations; however, these algorithms run in O(mn) time and similarly suffer from combinatorial explosion when applied to a DAG.
  • One solution to this problem is to use a multi-dimensional or graph-aware algorithm, such as if the alignment is advanced. One example of a graph-aware alignment algorithm is a modified Smith-Waterman algorithm that considers the highest scoring paths through a DAG (“Graph Smith Waterman”). Like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions. However, when applied to a linear reference or a DAG having low complexity, Graph Smith-Waterman can be slow relative to the alignment speed of a pattern matching algorithm.
  • Graph-Aware Alignment Algorithms
  • Certain alignment algorithms have been developed that are able to provide accurate and relatively fast local alignments when used with DAGs or other graph data structures. For example, a Graph Smith-Waterman algorithm as described below calculates a Smith-Waterman score matrix between the sequence read and each node in a DAG, such that the maximum scores are propagated across the matrices for each node. In this way, the linear portions of the DAG with the highest scores in relation to the sequence read are serialized and chained together. One need only look back through the chained Smith-Waterman matrices to identify both the optimal alignment and path of the DAG.
  • In some embodiments, the methods and systems of the invention use a modified Smith-Waterman operation that involves a multi-dimensional look-back through the reference DAG (such as the reference DAG 331 of FIG. 3). Multi-dimensional operations of the invention provide for a “look-back” type analysis of sequence information (as in Smith-Waterman), wherein the look back is conducted through a multi-dimensional space that includes multiple pathways and multiple nodes. The multi-dimensional algorithm can be used to align sequence reads against the graph-type reference. That alignment algorithm identifies the maximum value for Ci,j by identifying the maximum score with respect to each sequence contained at a position on the graph. In fact, by looking “backwards” at the preceding positions, it is possible to identify the optimum alignment across a plurality of possible paths.
  • The modified Smith-Waterman operation described here, aka the multi-dimensional alignment, provides exceptional speed when performed in a genomic graph system that employs physical memory addressing (e.g., through the use of native pointers or index free adjacency as discussed above). The combination of multi-dimensional alignment to the graph of reference DAG 331 with the use of spatial memory addresses (e.g., native pointers or index-free adjacency) improves what the computer system is capable of, facilitating whole genomic scale analysis to be performed using the methods described herein.
  • The operation includes aligning a sequence, or string, to a graph. For the purpose of defining the algorithm, let S be the string being aligned, and let D be the directed graph to which S is being aligned. The elements of the string, S, are bracketed with indices beginning at 1. Thus, if S is the string ATCGAA, S[1]=A, S[4]=G, etc.
  • In certain embodiments, for the graph, each letter of the sequence of a node will be represented as a separate element, d. In a preferred embodiment, node or edge objects contain the sequences and the sequences are stored as the longest-possible string in each object. A predecessor of d is defined as:
  • (i) If d is not the first letter of the sequence of its object, the letter preceding d in its object is its (only) predecessor;
  • (ii) If d is the first letter of the sequence of its object, the last letter of the sequence of any object that is a parent of d's object is a predecessor of d.
  • The set of all predecessors is, in turn, represented as P[d].
  • In order to find the “best” alignment, the algorithm seeks the value of M[j,d], the score of the optimal alignment of the first j elements of S with the portion of the graph preceding (and including) d. This step is similar to finding Hi,j in equation 1 above. Specifically, determining M[j,d] involves finding the maximum of a, i, e, and 0, as defined below:

  • M[j,d]=max{a,i,e,0}  (6)
  • where
  • e=max{M[j, p*]+DELETE_PEN} for p* in P[d]
  • i=M[j−1, d]+INSERT_PEN
  • a=max{M[j−1, p*]+MATCH_SCORE} for p* in P[d], if S[j]=d;
  • max{M[j−1, p*]+MISMATCH_PEN} for p* in P[d], if S[j]≠d
  • As described above, e is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including, d, plus an additional DELETE_PEN. Accordingly, if d is not the first letter of the sequence of the object, then there is only one predecessor, p, and the alignment score of the first j characters of S with the graph (up-to-and-including p) is equivalent to M[j,p]+DELETE_PEN. In the instance where d is the first letter of the sequence of its object, there can be multiple possible predecessors, and because the DELETE_PEN is constant, maximizing [M[j, p*]+DELETE_PEN] is the same as choosing the predecessor with the highest alignment score with the first j characters of S.
  • In equation (6), i is the alignment of the first j−1 characters of the string S with the graph up-to-and-including d, plus an INSERT_PEN, which is similar to the definition of the insertion argument in SW (see equation 1).
  • Additionally, a is the highest of the alignments of the first j characters of S with the portions of the graph up to, but not including d, plus either a MATCH_SCORE (if the jth character of S is the same as the character d) or a MISMATCH_PEN (if the jth character of S is not the same as the character d). As with e, this means that if d is not the first letter of the sequence of its object, then there is only one predecessor, i.e., p. That means a is the alignment score of the first j−1 characters of S with the graph (up-to-and-including p), i.e., M[j−1,p], with either a MISMATCH_PEN or MATCH_SCORE added, depending upon whether d and the jth character of S match. In the instance where d is the first letter of the sequence of its object, there can be multiple possible predecessors. In this case, maximizing {M[j,p*]+MISMATCH_PEN or MATCH_SCORE} is the same as choosing the predecessor with the highest alignment score with the first j−1 characters of S (i.e., the highest of the candidate M[j−1,p*] arguments) and adding either a MISMATCH_PEN or a MATCH_SCORE depending on whether d and the jth character of S match.
  • Again, as in the SW algorithm, the penalties, e.g., DELETE_PEN, INSERT_PEN, MATCH_SCORE and MISMATCH_PEN, can be adjusted to encourage alignment with fewer gaps, etc.
  • As described in the equations above, the operation finds the optimal (e.g., maximum) value for a sequence read 103 to the reference DAG 331 by calculating not only the insertion, deletion, and match scores for that element, but looking backward (against the direction of the graph) to any prior nodes on the graph to find a maximum score.
  • FIG. 3 shows matrices 301 that represent the comparison. The modified Smith-Waterman operation of the invention identifies the highest score and performs a backtrack to identify the proper alignment of the sequence. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. Systems and methods of the invention can be used to provide a report that identifies a modified base at the position within the genome of the subject. Other information may be found in Kehr et al., 2014, Genome alignment with graph data structures: a comparison, BMC Bioinformatics 15:99 and Lee, 2003, Generating consensus sequences from partial order multiple sequence alignment graphs, Bioinformatics 19(8):999-1008, both incorporated by reference. Thus, identifying an alignment for the genomic information from the sample (e.g., the sequence read 103 of the embodiment of FIG. 4) may include aligning 129 (e.g., using the global alignment 106 or the local alignment 108) one or more of the sequence reads 103 to the reference DAG 331, as shown in FIG. 4. The branches to which the sequence reads 103 have been aligned are identified, along with the corresponding regions within the reference genome.
  • Graph Smith-Waterman is able to achieve similar computational speed as Smith-Waterman, yet does not suffer from the combinatorial explosion associated with linearizing complex regions of DAGs. In contrast, the performance of Graph Smith-Waterman scales linearly as the complexity of the DAG increases. Additionally, like Smith-Waterman, Graph Smith-Waterman is an exact alignment algorithm and thus is able to identify optimal local alignments that include insertions and deletions. However, when applied to a linear reference sequence or a DAG having low complexity, Graph Smith-Waterman can be slow compared to pattern matching algorithms.
  • FIG. 4 illustrates finding 129 alignments 401 between the sequence 103 (or a consensus sequence representing an assembly or one or more sequence reads 103) and the reference DAG 331. Alignments can be performed directly against the DAG using the Graph Smith-Waterman algorithm, for example; however, as will be discussed below, alignments may also be performed against the DAG using linear alignment or pattern matching algorithms by first linearizing the DAG.
  • FIG. 4 also illustrates an optional variant calling 807 step to identify a subject genotype 831. Using alignment operations of the invention, reads can be rapidly mapped to the reference DAG 331 despite their large numbers or short lengths. Numerous benefits obtain by using a graph as a reference. For example, aligning against a graph is more accurate than aligning against a linear reference and then attempting to adjust one's results in light of other extrinsic information. This is primarily because the latter approach enforces an unnatural asymmetry between the sequence used in the initial alignment and other information. Aligning against an object that potentially represents all the relevant physical possibilities is much more computationally efficient than attempting to align against a linear sequence for each physical possibility (the number of such possibilities will generally be exponential in the number of junctions).
  • Many implementations provide methods that rely on a directed acyclic data structure with the data in an annotated transcriptome. In such a data structure, features (e.g., exons and introns) from the annotated transcriptome are represented as nodes, which are connected by edges. An isoform from the real-world transcriptome is thus represented by a corresponding path through the DAG-based data structure.
  • Aspects of the described methods relate to the creation or use of a DAG that includes features such as introns and exons from one or more known references. A DAG is understood in the art to refer to data that can be presented as a graph as well as to a graph that presents that data. The DAG can be stored as data that can be read by a computer system for bioinformatic processing or for presentation as a graph. A DAG can be saved in any suitable format including, for example, a list of nodes and edges, a matrix or a table representing a matrix, an array of arrays or similar variable structure representing a matrix, in a language built with syntax for graphs, in a general markup language purposed for a graph, or others. Further, while the DAG is a directed acyclic data structure, in certain embodiments, a reference graph may lack direction or not be acyclic. Various embodiments are considered to be within the scope of the disclosure.
  • DAG Representation and Storage
  • FIG. 5 shows the use of an adjacency list 502 for each vertex 505 and edge 509. A computer system 901 (shown in FIG. 10) creates the reference DAG 531 using an adjacency list 502 for each vertex and edge, wherein the adjacency list 502 for a vertex 505 or edge 509 lists the edges or vertices to which that vertex or edge is adjacent. Each entry in adjacency list 502 is a pointer to the adjacent vertex or edge.
  • Preferably, each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored. In the preferred embodiments, the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory and permits access to the intended data by means of pointer dereference. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer. The feature that separates pointers from other kinds of reference is that a pointer's value is interpreted as a memory address, at a low-level or hardware level. The speed and efficiency of the described graph genome engine allows a sequence to be queried against a large-scale genomic reference graph (e.g., the reference DAG 531) representing millions or billions of bases, using the computer system 901. Such a graph representation provides means for fast random access, modification, and data retrieval.
  • In some embodiments, fast random access is supported and graph object storage are implemented with index-free adjacency in that every element contains a direct pointer to its adjacent elements (e.g., as described in U.S. Pub. 2014/0280360 and U.S. Pub. 2014/0278590, each incorporated by reference), which obviates the need for index look-ups, allowing traversals (e.g., as done in the modified SW alignment operation described herein) to be very rapid. Index-free adjacency is another example of low-level, or hardware-level, memory referencing for data retrieval (as required in alignment and as particularly pays off in terms of speed gains in the modified, multi-dimensional Smith-Waterman alignment described below). Specifically, index-free adjacency can be implemented such that the pointers contained within elements are references to a physical location in memory.
  • Since a technological implementation that uses physical memory addressing such as native pointers can access and use data in such a lightweight fashion without the requirement of separate index tables or other intervening lookup steps, the capabilities of a given computer, e.g., any modern consumer-grade desktop computer, are extended to allow for full operation of a genomic-scale graph (i.e., the reference DAG 531 as a graph that represents all loci in a substantial portion of the reference genome). Thus storing graph elements (e.g., nodes and edges) using a library of objects with native pointers or other implementation that provides index-free adjacency actually improves the ability of the technology to provide storage, retrieval, and alignment for genomic information since it uses the physical memory of a computer in a particular way.
  • While no specific format is required for storage of a graph, FIG. 5 is presented to illustrate a useful format. With reference back to FIG. 1, it is noted that methods of the invention use the stored graph (e.g., the reference genome) with sequence reads that are obtained from a subject. In some embodiments, sequence reads or obtained as an electronic article, e.g., uploaded, emailed, or FTP transferred from a lab to the computer system 901. In certain embodiments, sequence reads are obtained by sequencing.
  • In some embodiments, a DAG is stored as a list of nodes and edges. One such way is to create a text file that includes all nodes, with an ID assigned to each node, and all edges, each with the node ID of starting and ending node. Thus, for example, were a DAG to be created for two sentences, “See Jane run,” and “Run, Jane run,”, a case-insensitive text file could be created. Any suitable format could be used. For example, the text file could include comma-separated values. Naming this DAG “Jane” for future reference, in this format, the DAG “Jane” may read as follows: 1 see, 2 run, 3 jane, 4 run, 1-3, 2-3, 3-4. One of skill in the art will appreciate that this structure is applicable to the introns and exons represented in FIGS. 7 and 8, and the accompanying discussion below.
  • In certain embodiments, a DAG is stored as a table representing a matrix (or an array of arrays or similar variable structure representing a matrix) in which the (i,j) entry in the N×N matrix denotes that node i and node j are connected (where N is a vector containing the nodes in genomic order). For the DAG to be acyclic simply requires that all non-zero entries be above the diagonal (assuming nodes are represented in genomic order). In a binary case, a 0 entry represents that no edge is exists from node i to node j, and a 1 entry represents an edge from i to j. One of skill in the art will appreciate that a matrix structure allows values other than 0 to 1 to be associated with an edge. For example, any entry may be a numerical value indicating a weight, or a number of times used, reflecting some natural quality of observed data in the world. A matrix can be written to a text file as a table or a linear series of rows (e.g., row 1 first, followed by a separator, etc.), thus providing a simple serialization structure.
  • Embodiments of the invention include storing a DAG in a language built with syntax for graphs. For example, The DOT Language provided with the graph visualization software packages known as Graphviz provides a data structure that can be used to store a DAG with auxiliary information and that can be converted into graphic file formats using a number of tools available from the Graphviz web site. Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images and SVG for web pages; PDF or Postscript for inclusion in other documents; or display in an interactive graph browser.
  • In related embodiments, a DAG is stored in a general markup language purposed for a graph, or others. Following the descriptions of a linear text file, or a comma-separated matrix, above, one of skill in the art will recognize that a language such as XML can be used (extended) to create labels (markup) defining nodes and their headers or IDs, edges, weights, etc. However a DAG is structured and stored, embodiments of the invention involve using nodes to represent features such as exons and introns. This provides a useful tool for analyzing sequence reads and discovering, identifying, and representing isoforms.
  • If nodes represent features or fragments of features, then isoforms can be represented by paths through those fragments. An exon's being “skipped” is represented by an edge connecting some previous exon to some later one. Presented herein are techniques for constructing a DAG to represent alternative splicing or isoforms of genes. Alternative splicing is discussed in Lee and Wang, 2005, Bioinformatics analysis of alternative splicing, Brief Bioinf 6(1):23-33; Heber, et al., 2002, Splicing graphs and EST assembly problems, Bioinformatics 18Suppl:s181-188; Leipzig, et al., 2004, The alternative splicing gallery (ASG): Bridging the gap between genome and transcriptome, Nucl Ac Res 23(13):3977-2983; and LeGault and Dewey, 2013, Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs, Bioinformatics 29(18):2300-2310, the contents of each of which are incorporated by reference. Additional discussion may be found in Florea, et al., 2005, Gene and alternative splicing annotation with AIR, Genome Research 15:54-66; Kim, et al., 2005, ECgene: Genome-based EST clustering and gene modeling for alternative splicing, Genome Research 15:566-576; and Xing, et al., 2006, An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs, Nucleic Acids Research, 34, 3150-3160, the contents of each of which are incorporated by reference.
  • Linearizing a DAG for a Linear Alignment Tool
  • As previously noted, a DAG is a multi-dimensional structure and not immediately amenable to alignment with a linear alignment tool. To use an optimal alignment algorithm or pattern matching algorithm, a reference DAG can be linearized into a format appropriate for the algorithm. In certain embodiments, this can be performed by enumerating each of the possible paths in the subsection of a reference DAG. FIG. 6 illustrates an exemplary DAG 601 representing a subsection of the reference DAG for a candidate mapping position (e.g., a local alignment position). To use a linear alignment algorithm with such a DAG, a depth-first search can be performed by starting at the first node and taking the left-most path. Once the end of the graph is reached, the search returns to the previous node with an untaken path and attempts to take the left-most path again. Each path is associated with a sequence represented by the nodes and edges in the path. The sequence read can then be compared against each of the paths. For example, a depth-first search would identify the following paths and associated sequences from the subsection of the DAG 661 shown in the embodiment of FIG. 6:
  • (SEQ ID NO: 1)
    1. ATCGACGGCGTTTGCAT
    (SEQ ID NO: 2)
    2. ATCGACGGCGTTTAGCAT (insertion of A at pos. 13)
    (SEQ ID NO: 3)
    3. ATCGACGACGTTTGCAT (G−>A polymorphism at pos. 8)
    (SEQ ID NO: 4)
    4. ATCGACGACGTTTAGCAT (G−>A polymorphism at pos.
    8 + insertion of A at pos. 13)

    To perform a local alignment of the sequence read GTTTAG against this subsection of the reference DAG, these four sequences could be provided as individual reference sequences to a linear alignment or pattern matching algorithm. The linear alignment or pattern matching algorithm would subsequently identify a best-fit alignment for the sequence read with respect to the four sequences. In this way, systems and methods according to the disclosure may enumerate the possible paths for a reference DAG or subsection of a reference DAG, identify a sequence associated with each possible path, and provide each sequence to an alignment tool (such as those disclosed herein) for alignment with a sequence read.
  • In another embodiment, the subsection of the reference DAG could be linearized by walking the graph using a sliding window of n bases (wherein n is the length of the sequence read) and performing a comparison of the sequence read to the reference sequence in the window. (Of course, in certain embodiments, the size of the window can be increased to accommodate the possibility of small insertions and deletions in the sequence read.) The walk can similarly be performed using a depth-first search. For example, for the sequence read GTTTAG and subsection of the reference DAG 601 in the embodiment of FIG. 6, a local alignment could proceed as shown in Table 1.
  • TABLE 1
    Depth-First Search Results using a sliding window of n bases and performing a
    comparison of the sequence read to the reference sequence in the Window.
    # Path Sequences Read Comparison Note
     1 ATCGAC GTTTAG No Match Start of Graph
     2 TCGACG GTTTAG No Match Includes “G” Polymorphism
     3 CGACGG GTTTAG No Match Includes “G” Polymorphism
     4 GACGGC GTTTAG No Match Includes “G” Polymorphism
     5 ACGGCG GTTTAG No Match Includes “G” Polymorphism
     6 CGGCGT GTTTAG No Match Includes “G” Polymorphism
     7 GGCGTT GTTTAG No Match Includes “G” Polymorphism
     8 GCGTTT GTTTAG No Match Includes “G” Polymorphism
     9 CGTTTG GTTTAG No Match
    10 GTTTGC GTTTAG No Match
    11 TTTGCA GTTTAG No Match
    12 TTGCAT GTTTAG No Match End of Graph
    13 CGTTTA GTTTAG No Match Previous Unfollowed Path
    Includes “A” Insertion
    14 GTTTAG GTTTAG Match Includes “A” Insertion
    15 TTTAGC GTTTAG No Match Includes “A” Insertion
    16 TTAGCA GTTTAG No Match Includes “A” Insertion
    17 TAGCAT GTTTAG No Match Includes “A” Insertion
    End of Graph
    18 CGACGA GTTTAG No Match Previous Unfollowed Path
    Includes “A” Polymorphism
    19 GACGAC GTTTAG No Match Includes “A” Polymorphism
    20 ACGACG GTTTAG No Match Includes “A” Polymorphism
    21 CGACGT GTTTAG No Match Includes “A” Polymorphism
    22 GACGTT GTTTAG No Match Includes “A” Polymorphism
    23 ACGTTT GTTTAG No Match Includes “A” Polymorphism
    No remaining paths to follow
  • Once the walk has completed, the highest scoring matches can be identified to obtain the locally aligned position for the sequence read. As shown in Table 1, a match occurs at the 14th comparison. While a depth-first search is described for traversing the graph, various other searching algorithms or techniques may be used. For example, a breadth-first search could be substituted. Similarly, one could also employ an algorithm following the right-most path, or any previously unfollowed path. Various embodiments are considered to be within the scope of the disclosure.
  • As previously noted, pattern matching algorithms are particularly well suited for aligning sequence reads to DAGs. As known variations are already present in the graph, the pattern matching algorithm does not need to consider the presence of indels or other variations within a sequence read. Instead, the pattern matching algorithm simply considers each path available through the graph, such as in the manner described above. However, as noted above, while pattern matching algorithms are fast, the performance of such an algorithm (and other local alignment algorithms) can be affected based on the complexity of the DAG surrounding the local alignment position.
  • Considering DAG Complexity to Select a Local Alignment Tool
  • The complexity of a DAG at a given region indicates the amount of variation in the reference genome at that position. The complexity of a DAG can be determined in a variety of ways. For example, the complexity of a DAG can be the number of nodes, number of vertices, number of branches, number of possible paths, and the like. Complexity may be determined for a subsection or region of the DAG, or for the entire DAG.
  • For example, a DAG 701 shown in FIG. 7 includes information for a single nucleotide polymorphism (C or D) and an insertion (EG or EFG). There are a total of seven nodes and four paths through the DAG. In another example, a DAG 801 shown in FIG. 8 indicates a highly variable region of the genome that includes information regarding various structural variations, SNPs, and small insertions and deletions. In this DAG 801, there are 10 nodes and 21 possible paths.
  • If there are few paths (e.g., 10 or less, 9 or less, 8 or less, 7 or less, 6 or less, 5 or less, 4 or less, 3 or less 2 or less) through the DAG, then the DAG complexity is low. In these cases, a local linear sequence alignment or pattern matching algorithm may be faster and more efficient than a graph aware algorithm. In these cases, the additional number of paths that can be tested by the algorithm does not significantly impact the advantages in speed gained by using a pattern matching algorithm (e.g., by using a shift table to avoid unnecessary computations). Further, the required number of comparisons can be performed in less processing time than by generating a plurality of matrices for each node, as required by Graph Smith-Waterman. However, if the complexity is high (e.g., 10 or more paths, 10-15 paths, 15 or more paths, 15-20 paths, or 20 or more paths) then the number of paths that must be tested quickly increases. In these cases, Graph Smith-Waterman may be more efficient than a linear sequence alignment or pattern matching algorithm because it does not suffer from the combinatorial explosion problem associated with linearizing complex DAGs.
  • The relationship between the alignment processing time and local DAG complexity is shown in FIG. 9 as a graph 900. Graph 900 illustrates the relationship between processing time and DAG complexity for a pattern matching algorithm (such as Tarhio-Ukkonen) and a graph aware algorithm (such as Graph Smith-Waterman). As shown, at a level of complexity C, performance of Tarhio-Ukkonen and Graph Smith-Waterman is equal and either algorithm may be used without a deviation in performance. Above C, Tarhio-Ukkonen becomes more inefficient due to the large number of paths that must be individually tested, and Graph Smith-Waterman may be preferred. In contrast, below C, Tarhio-Ukkonen is more efficient and requires less processing time.
  • Even though performance of either algorithm may be comparable at a level of complexity C, other considerations may dictate usage of either algorithm depending on the circumstances. For example, in certain instances, switching local alignment algorithms at a level of complexity C′ may be preferable instead of C. As discussed elsewhere, Graph Smith-Waterman is an exact alignment algorithm and thus able to identify gapped alignments between a sequence read and a reference. Thus, even if an approximate pattern-matching algorithm would execute in less time, it may still be preferable to choose Graph Smith-Waterman because the resulting alignment can be superior. In these cases, the ideal level of complexity at which the method chooses either algorithm is based on the acceptable timeliness of performance. As shown in FIG. 9, above C, Graph Smith-Waterman is may be preferable. However, below C, one may also decide to use Graph Smith-Waterman. The consideration is the tradeoff between processing time and improved alignments. Given the scales typically associated with next generation sequencing (e.g., to cover the human genome at 30× coverage, approximately 600 million paired-end 100 bp sequence reads are required), a processing method may account for a level of complexity in which the processing time is reasonable, yet the alignments are able to yield useful information regarding unknown variants.
  • Price may also be a consideration. For example, cloud computing services such as Amazon AWS and Microsoft Azure offer cloud-based computational resources at a cost. When using these services to align next generation sequence reads against DAGs, one may decide to switch to Tarhio-Ukkonen only when its performance is less than Graph Smith-Waterman in order to reduce processing time and increase cost efficiency.
  • In certain embodiments, a sequence read could be realigned depending on the resulting quality of the local alignment. For example, if a sequence read has an unknown variation (such as an insertion or deletion), and the complexity of a DAG at a candidate mapping position is low and a pattern matching algorithm is used, the resulting alignment may have low quality. This is because pattern matching algorithms, while fast, may not gracefully handle unknown variations. In these cases, an optimal alignment algorithm (such as Smith Waterman) could be subsequently performed to accurately identify the variation and improve the local alignment. Similarly, in high entropy or high variability areas of the genome, an optimal alignment algorithm could be substituted for a pattern matching algorithm to improve the quality of the alignment. The quality of an alignment may be expressed according to the Phred scale, for example.
  • Additionally, in certain embodiments, more than two alignment algorithms may be selected based on the complexity. For example, one may consider using a pattern matching algorithm, an optimal alignment algorithm, and a graph-aware algorithm as complexity increases. In this way, systems and methods according to the invention may select from two, three, or more local alignment algorithms to significantly improve the performance of next-generation sequencing alignment using graph data structures.
  • Exemplary System
  • FIG. 10 illustrates a computer system 901 suitable for performing methods of the invention. The computer system 901 includes at least one computer 633. Optionally, the computer system 901 may further include one or more of a server computer 909 and a sequencer 955, which may be coupled to a sequencer computer 951. Each computer in the computer system 901 includes a processor 935, 936, or 937 coupled to a memory device and at least one input/output device. Thus the computer system 901 includes at least one processor 935 coupled to a memory subsystem 975 (e.g., a memory device or collection of memory devices). Using those mechanical components, the computer system 901 is operable to obtain a sequence generated by sequencing nucleic acid from a genome of a patient. The system uses the processor to transform the sequence read 103 information into the reference DAG 331.
  • Processor refers to any device or system of devices that performs processing operations. A processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU). A processor may be provided by a chip from Intel or AMD. A processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
  • The memory subsystem 975 contains one or any combination of memory devices. A memory device is a mechanical device that stores data or instructions in a machine-readable format. Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein. Preferably, each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.
  • Using the described components, the computer system 901 is operable to produce a report and provide the report to a user via an input/output device. An input/output device is a mechanism or system for transferring data into or out of a computer. Exemplary input/output devices include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), a printer, an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a speaker, a touchscreen, an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device, which can be, for example, a network interface card (NIC), Wi-Fi card, or cellular modem.
  • Preferably the reference DAG is stored in the memory subsystem using adjacency techniques, which may include pointers to identify a physical location in the memory subsystem 675 where each vertex is stored. In a preferred embodiment, the graph is stored in the memory subsystem 675 using adjacency lists. In some embodiments, there is an adjacency list for each vertex. For discussion of implementations see ‘Chapter 4, Graphs’ at pages 515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson Education, Inc., Upper Saddle River N.J., 955 pages, the contents of which are incorporated by reference and within which pages 524-527 illustrate adjacency lists.
  • INCORPORATION BY REFERENCE
  • References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.
  • EQUIVALENTS
  • Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.
  • Other implementations are within the scope of the following claims.

Claims (20)

What is claimed is:
1. A method, comprising:
obtaining a sequence read including genetic information;
identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes;
determining, by means of a computer system, whether an alignment with the graph surrounding each of the plurality of candidate mapping positions is advanced or basic;
performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic, wherein:
the advanced local alignment includes a first-local-alignment algorithm, and
the basic local alignment includes a second-local-alignment algorithm; and
based on the local alignments, identifying an optimal mapped position of the sequence read within the reference genome.
2. The method of claim 1, wherein the first-local-alignment algorithm is different from the second-local-alignment algorithm.
3. The method of claim 1, further comprising determining whether the local alignment is advanced or basic based on at least one of: a length of the graph, a variability of the graph, a total processing time, a remaining processing time, and a number of repeating elements.
4. The method of claim 1, further comprising determining whether the local alignment is advanced or basic based, at least in part, on a complexity of the graph.
5. The method of claim 4, further comprising determining whether the local alignment is advanced or basic based, at least in part, on a complexity of a subset of the graph surrounding each candidate mapping position.
6. The method of claim 5, wherein the local alignment is advanced if the complexity of a subset of the graph surrounding a candidate mapping position is 10 or more nodes.
7. The method of claim 5, wherein the local alignment is basic if the complexity of a subset of the graph surrounding a candidate mapping position is 5 or fewer nodes.
8. The method of claim 1, wherein the second-local-alignment algorithm comprises a pattern matching algorithm.
9. The method of claim 8, wherein the pattern matching algorithm is selected from the group consisting of: a Boyer-Moore algorithm, a Horspool algorithm, and a Tarhio-Ukkonen algorithm.
10. The method of claim 1, wherein performing a basic local alignment comprises:
linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences, and
performing a basic local alignment of the sequence read against each of the plurality of linear sequences using the second-local-alignment algorithm.
11. The method of claim 10, wherein linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences comprises enumerating the number of unique paths through the subset of the graph, and associating a linear sequence with each enumerated path.
12. The method of claim 10, wherein linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences comprises performing a depth first search of the subset of the graph.
13. The method of claim 1, wherein the first-local-alignment algorithm comprises a graph aware algorithm.
14. The method of claim 13, wherein the graph aware algorithm is a modified Smith Waterman algorithm.
15. The method of claim 1, further comprising ranking each of the candidate mapping positions based on a quality of the local alignment.
16. The method of claim 15, further comprising re-aligning the sequence read using a third-local-alignment algorithm if the quality of the highest ranking local alignment is low.
17. A system for determining a subject's genetic information, the system comprising:
a computer system comprising a processor coupled to memory and operable to:
receive identities of a plurality of nucleotides at known locations on a reference genome;
receive sequence reads from a sample from a subject; and
map the sequence reads to the reference genome, thereby identifying a corresponding location on the reference genome, the mapping comprising:
identifying, within a graph representing a reference genome, a plurality of candidate mapping positions that relate to the genetic information, the graph comprising nodes representing genetic sequences and edges connecting pairs of nodes;
determining, by means of a computer system, whether an alignment with the graph surrounding each of the identified plurality of candidate mapping positions is advanced or basic;
performing for each candidate mapping position, by means of the computer system, a local alignment based on whether the local alignment is advanced or basic, wherein:
the advanced local alignment includes a first-local-alignment algorithm, and
the basic local alignment includes a second-local-alignment algorithm; and
based on the local alignments, identify the mapped position of the sequence read within the reference genome.
18. The system of claim 17, wherein the computer system is further operable to determine whether the local alignment is advanced or basic based, at least in part, on a complexity of a subset of the graph surrounding each candidate mapping position.
19. The system of claim 17, wherein the first-local-alignment algorithm comprises a graph aware alignment algorithm, and the second-local-alignment algorithm comprises a linear alignment algorithm.
20. The system of claim 17, wherein performing a basic local alignment comprises:
linearizing a subset of the graph surrounding each candidate mapping position into a plurality of linear sequences, and
performing a basic local alignment of the sequence read against each of the plurality of linear sequences using the second-local-alignment algorithm.
US14/990,323 2016-01-07 2016-01-07 Systems and methods for adaptive local alignment for graph genomes Abandoned US20170199960A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US14/990,323 US20170199960A1 (en) 2016-01-07 2016-01-07 Systems and methods for adaptive local alignment for graph genomes
PCT/US2017/012015 WO2017120128A1 (en) 2016-01-07 2017-01-03 Systems and methods for adaptive local alignment for graph genomes
US16/663,243 US11810648B2 (en) 2016-01-07 2019-10-24 Systems and methods for adaptive local alignment for graph genomes
US18/464,597 US20240096450A1 (en) 2016-01-07 2023-09-11 Systems and methods for adaptive local alignment for graph genomes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US14/990,323 US20170199960A1 (en) 2016-01-07 2016-01-07 Systems and methods for adaptive local alignment for graph genomes

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/663,243 Continuation US11810648B2 (en) 2016-01-07 2019-10-24 Systems and methods for adaptive local alignment for graph genomes

Publications (1)

Publication Number Publication Date
US20170199960A1 true US20170199960A1 (en) 2017-07-13

Family

ID=57890907

Family Applications (3)

Application Number Title Priority Date Filing Date
US14/990,323 Abandoned US20170199960A1 (en) 2016-01-07 2016-01-07 Systems and methods for adaptive local alignment for graph genomes
US16/663,243 Active 2038-08-23 US11810648B2 (en) 2016-01-07 2019-10-24 Systems and methods for adaptive local alignment for graph genomes
US18/464,597 Pending US20240096450A1 (en) 2016-01-07 2023-09-11 Systems and methods for adaptive local alignment for graph genomes

Family Applications After (2)

Application Number Title Priority Date Filing Date
US16/663,243 Active 2038-08-23 US11810648B2 (en) 2016-01-07 2019-10-24 Systems and methods for adaptive local alignment for graph genomes
US18/464,597 Pending US20240096450A1 (en) 2016-01-07 2023-09-11 Systems and methods for adaptive local alignment for graph genomes

Country Status (2)

Country Link
US (3) US20170199960A1 (en)
WO (1) WO2017120128A1 (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10053736B2 (en) 2013-10-18 2018-08-21 Seven Bridges Genomics Inc. Methods and systems for identifying disease-induced mutations
US10055539B2 (en) 2013-10-21 2018-08-21 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
US10078724B2 (en) 2013-10-18 2018-09-18 Seven Bridges Genomics Inc. Methods and systems for genotyping genetic samples
US10192026B2 (en) 2015-03-05 2019-01-29 Seven Bridges Genomics Inc. Systems and methods for genomic pattern analysis
US10262102B2 (en) 2016-02-24 2019-04-16 Seven Bridges Genomics Inc. Systems and methods for genotyping with graph reference
US10319465B2 (en) 2016-11-16 2019-06-11 Seven Bridges Genomics Inc. Systems and methods for aligning sequences to graph references
US10325675B2 (en) 2013-08-21 2019-06-18 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
WO2019226976A1 (en) * 2018-05-25 2019-11-28 New York Institute Of Technology Method and system for use in direct sequencing of rna
US10540398B2 (en) * 2017-04-24 2020-01-21 Oracle International Corporation Multi-source breadth-first search (MS-BFS) technique and graph processing system that applies it
US10584380B2 (en) 2015-09-01 2020-03-10 Seven Bridges Genomics Inc. Systems and methods for mitochondrial analysis
WO2020123906A1 (en) * 2018-12-13 2020-06-18 The General Hospital Corporation Biologically informed and accurate sequence alignment
US10724110B2 (en) 2015-09-01 2020-07-28 Seven Bridges Genomics Inc. Systems and methods for analyzing viral nucleic acids
US10726110B2 (en) 2017-03-01 2020-07-28 Seven Bridges Genomics, Inc. Watermarking for data security in bioinformatic sequence analysis
US10790044B2 (en) 2016-05-19 2020-09-29 Seven Bridges Genomics Inc. Systems and methods for sequence encoding, storage, and compression
US10793895B2 (en) 2015-08-24 2020-10-06 Seven Bridges Genomics Inc. Systems and methods for epigenetic analysis
CN111798923A (en) * 2019-05-24 2020-10-20 中国科学院计算技术研究所 Fine-grained load characteristic analysis method and device for gene comparison and storage medium
US10832797B2 (en) 2013-10-18 2020-11-10 Seven Bridges Genomics Inc. Method and system for quantifying sequence alignment
US10867693B2 (en) 2014-01-10 2020-12-15 Seven Bridges Genomics Inc. Systems and methods for use of known alleles in read mapping
US10878938B2 (en) 2014-02-11 2020-12-29 Seven Bridges Genomics Inc. Systems and methods for analyzing sequence data
CN112236824A (en) * 2018-05-31 2021-01-15 皇家飞利浦有限公司 Systems and methods for allelic interpretation using map-based reference genomes
WO2021058683A1 (en) * 2019-09-25 2021-04-01 Koninklijke Philips N.V. Variant calling for multi-sample variation graph
US11049587B2 (en) 2013-10-18 2021-06-29 Seven Bridges Genomics Inc. Methods and systems for aligning sequences in the presence of repeating elements
US11211146B2 (en) 2013-08-21 2021-12-28 Seven Bridges Genomics Inc. Methods and systems for aligning sequences
US11250931B2 (en) 2016-09-01 2022-02-15 Seven Bridges Genomics Inc. Systems and methods for detecting recombination
US11289177B2 (en) 2016-08-08 2022-03-29 Seven Bridges Genomics, Inc. Computer method and system of identifying genomic mutations using graph-based local assembly
US11347704B2 (en) 2015-10-16 2022-05-31 Seven Bridges Genomics Inc. Biological graph or sequence serialization
US11347844B2 (en) 2017-03-01 2022-05-31 Seven Bridges Genomics, Inc. Data security in bioinformatic sequence analysis
US11435283B2 (en) 2018-04-17 2022-09-06 Sharif University Of Technology Optically detecting mutations in a sequence of DNA
US11560598B2 (en) 2016-01-13 2023-01-24 Seven Bridges Genomics Inc. Systems and methods for analyzing circulating tumor DNA
WO2023170091A1 (en) 2022-03-09 2023-09-14 Politecnico Di Milano Methods for the alignment of sequence reads to non-acyclic genome graphs on heterogeneous computing systems
US11810648B2 (en) 2016-01-07 2023-11-07 Seven Bridges Genomics Inc. Systems and methods for adaptive local alignment for graph genomes

Family Cites Families (108)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5583024A (en) 1985-12-02 1996-12-10 The Regents Of The University Of California Recombinant expression of Coleoptera luciferase
US5511158A (en) 1994-08-04 1996-04-23 Thinking Machines Corporation System and method for creating and evolving directed graphs
US5750341A (en) 1995-04-17 1998-05-12 Lynx Therapeutics, Inc. DNA sequencing by parallel oligonucleotide extensions
US5701256A (en) 1995-05-31 1997-12-23 Cold Spring Harbor Laboratory Method and apparatus for biological sequence comparison
GB9620209D0 (en) 1996-09-27 1996-11-13 Cemu Bioteknik Ab Method of sequencing DNA
US6054278A (en) 1997-05-05 2000-04-25 The Perkin-Elmer Corporation Ribosomal RNA gene polymorphism based microorganism identification
US6054276A (en) 1998-02-23 2000-04-25 Macevicz; Stephen C. DNA restriction site mapping
US6223128B1 (en) 1998-06-29 2001-04-24 Dnstar, Inc. DNA sequence assembly system
US6787308B2 (en) 1998-07-30 2004-09-07 Solexa Ltd. Arrayed biomolecules and their use in sequencing
GB9901475D0 (en) 1999-01-22 1999-03-17 Pyrosequencing Ab A method of DNA sequencing
US6818395B1 (en) 1999-06-28 2004-11-16 California Institute Of Technology Methods and apparatus for analyzing polynucleotide sequences
WO2001023610A2 (en) 1999-09-29 2001-04-05 Solexa Ltd. Polynucleotide sequencing
NZ524171A (en) 2000-07-18 2006-09-29 Correlogic Systems Inc A process for discriminating between biological states based on hidden patterns from biological data
NO20004869D0 (en) 2000-09-28 2000-09-28 Torbjoern Rognes Method for fast optimal local sequence alignment using parallel processing
WO2002072892A1 (en) 2001-03-12 2002-09-19 California Institute Of Technology Methods and apparatus for analyzing polynucleotide sequences by asynchronous base extension
US6890763B2 (en) 2001-04-30 2005-05-10 Syn X Pharma, Inc. Biopolymer marker indicative of disease state having a molecular weight of 1350 daltons
US7809509B2 (en) 2001-05-08 2010-10-05 Ip Genesis, Inc. Comparative mapping and assembly of nucleic acid sequences
US7577554B2 (en) 2001-07-03 2009-08-18 I2 Technologies Us, Inc. Workflow modeling using an acyclic directed graph data structure
US20040023209A1 (en) 2001-11-28 2004-02-05 Jon Jonasson Method for identifying microorganisms based on sequencing gene fragments
CA2484625A1 (en) 2002-05-09 2003-11-20 Surromed, Inc. Methods for time-alignment of liquid chromatography-mass spectrometry data
US7225324B2 (en) 2002-10-31 2007-05-29 Src Computers, Inc. Multi-adaptive processing systems and techniques for enhancing parallelism and performance of computational functions
WO2004061616A2 (en) 2002-12-27 2004-07-22 Rosetta Inpharmatics Llc Computer systems and methods for associating genes with traits using cross species data
US7885840B2 (en) 2003-01-07 2011-02-08 Sap Aktiengesellschaft System and method of flexible workflow management
US7575865B2 (en) 2003-01-29 2009-08-18 454 Life Sciences Corporation Methods of amplifying and sequencing nucleic acids
US7953891B2 (en) 2003-03-18 2011-05-31 Microsoft Corporation Systems and methods for scheduling data flow execution based on an arbitrary graph describing the desired data flow
JP2005092719A (en) 2003-09-19 2005-04-07 Nec Corp Haplotype estimation method, estimation system, and program
EP2202322A1 (en) 2003-10-31 2010-06-30 AB Advanced Genetic Analysis Corporation Methods for producing a paired tag from a nucleic acid sequence and methods of use thereof
US7169560B2 (en) 2003-11-12 2007-01-30 Helicos Biosciences Corporation Short cycle methods for sequencing polynucleotides
US20060195266A1 (en) 2005-02-25 2006-08-31 Yeatman Timothy J Methods for predicting cancer outcome and gene signatures for use therein
WO2005107412A2 (en) 2004-04-30 2005-11-17 Rosetta Inpharmatics Llc Systems and methods for reconstruction gene networks in segregating populations
US7642511B2 (en) 2004-09-30 2010-01-05 Ut-Battelle, Llc Ultra high mass range mass spectrometer systems
WO2006052242A1 (en) 2004-11-08 2006-05-18 Seirad, Inc. Methods and systems for compressing and comparing genomic data
EP1910537A1 (en) 2005-06-06 2008-04-16 454 Life Sciences Corporation Paired end sequencing
US7329860B2 (en) 2005-11-23 2008-02-12 Illumina, Inc. Confocal imaging methods and apparatus
US7580918B2 (en) 2006-03-03 2009-08-25 Adobe Systems Incorporated System and method of efficiently representing and searching directed acyclic graph structures in databases
GB2436564A (en) 2006-03-31 2007-10-03 Plant Bioscience Ltd Prediction of heterosis and other traits by transcriptome analysis
US7282337B1 (en) 2006-04-14 2007-10-16 Helicos Biosciences Corporation Methods for increasing accuracy of nucleic acid sequencing
US7754429B2 (en) 2006-10-06 2010-07-13 Illumina Cambridge Limited Method for pair-wise sequencing a plurity of target polynucleotides
US8262900B2 (en) 2006-12-14 2012-09-11 Life Technologies Corporation Methods and apparatus for measuring analytes using large scale FET arrays
US8349167B2 (en) 2006-12-14 2013-01-08 Life Technologies Corporation Methods and apparatus for detecting molecular interactions using FET arrays
JP5622392B2 (en) 2006-12-14 2014-11-12 ライフ テクノロジーズ コーポレーション Method and apparatus for analyte measurement using large-scale FET arrays
EP2374902B1 (en) 2007-01-26 2017-11-01 Illumina, Inc. Nucleic acid sequencing system and method
US8146099B2 (en) 2007-09-27 2012-03-27 Microsoft Corporation Service-oriented pipeline based architecture
US20090119313A1 (en) 2007-11-02 2009-05-07 Ioactive Inc. Determining structure of binary data using alignment algorithms
US8478544B2 (en) 2007-11-21 2013-07-02 Cosmosid Inc. Direct identification and measurement of relative populations of microorganisms with direct DNA sequencing and probabilistic methods
US20090164135A1 (en) 2007-12-21 2009-06-25 Brodzik Andrzej K Quaternionic algebra approach to dna and rna tandem repeat detection
US20100010992A1 (en) 2008-07-10 2010-01-14 Morris Robert P Methods And Systems For Resolving A Location Information To A Network Identifier
US20100041048A1 (en) 2008-07-31 2010-02-18 The Johns Hopkins University Circulating Mutant DNA to Assess Tumor Dynamics
US20100035252A1 (en) 2008-08-08 2010-02-11 Ion Torrent Systems Incorporated Methods for sequencing individual nucleic acids under tension
US9347089B2 (en) 2008-09-19 2016-05-24 Children's Medical Center Corporation Therapeutic and diagnostic strategies
US8546128B2 (en) 2008-10-22 2013-10-01 Life Technologies Corporation Fluidics system for sequential delivery of reagents
US20100137143A1 (en) 2008-10-22 2010-06-03 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US20100301398A1 (en) 2009-05-29 2010-12-02 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US8691510B2 (en) 2008-11-07 2014-04-08 Sequenta, Inc. Sequence analysis of complex amplicons
US8370079B2 (en) 2008-11-20 2013-02-05 Pacific Biosciences Of California, Inc. Algorithms for sequence determination
DE102008061772A1 (en) 2008-12-11 2010-06-17 Febit Holding Gmbh Method for studying nucleic acid populations
US10839940B2 (en) 2008-12-24 2020-11-17 New York University Method, computer-accessible medium and systems for score-driven whole-genome shotgun sequence assemble
US8352195B2 (en) 2009-03-20 2013-01-08 Siemens Corporation Methods and systems for identifying PCR primers specific to one or more target genomes
DK2511843T3 (en) 2009-04-29 2017-03-27 Complete Genomics Inc METHOD AND SYSTEM FOR DETERMINING VARIATIONS IN A SAMPLE POLYNUCLEOTIDE SEQUENCE IN TERMS OF A REFERENCE POLYNUCLEOTIDE SEQUENCE
US8673627B2 (en) 2009-05-29 2014-03-18 Life Technologies Corporation Apparatus and methods for performing electrochemical reactions
US8574835B2 (en) 2009-05-29 2013-11-05 Life Technologies Corporation Scaffolded nucleic acid polymer particles and methods of making and using
US9524369B2 (en) 2009-06-15 2016-12-20 Complete Genomics, Inc. Processing and analysis of complex nucleic acid sequence data
US20110098193A1 (en) 2009-10-22 2011-04-28 Kingsmore Stephen F Methods and Systems for Medical Sequencing Analysis
MX2012008233A (en) 2010-01-14 2012-08-17 Fluor Tech Corp 3d plant modeling systems and methods.
WO2011103236A2 (en) 2010-02-18 2011-08-25 The Johns Hopkins University Personalized tumor biomarkers
US9165109B2 (en) 2010-02-24 2015-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20120030566A1 (en) 2010-07-28 2012-02-02 Victor B Michael System with touch-based selection of data items
US20130289099A1 (en) 2010-12-17 2013-10-31 Universite Pierre Et Marie Curie (Paris 6) Abcg1 gene as a marker and a target gene for treating obesity
EP2663655B1 (en) 2011-01-14 2015-09-02 Keygene N.V. Paired end random sequence based genotyping
RU2013138422A (en) 2011-01-19 2015-02-27 Конинклейке Филипс Электроникс Н.В. METHOD FOR PROCESSING GENOMIC DATA
US20120239706A1 (en) 2011-03-18 2012-09-20 Los Alamos National Security, Llc Computer-facilitated parallel information alignment and analysis
JP2014516514A (en) 2011-04-14 2014-07-17 コンプリート・ジェノミックス・インコーポレイテッド Processing and analysis of complex nucleic acid sequence data
US20130059738A1 (en) 2011-04-28 2013-03-07 Life Technologies Corporation Methods and compositions for multiplex pcr
US9506167B2 (en) 2011-07-29 2016-11-29 Ginkgo Bioworks, Inc. Methods and systems for cell state quantification
EP2758908A1 (en) 2011-09-20 2014-07-30 Life Technologies Corporation Systems and methods for identifying sequence variation
WO2013097257A1 (en) 2011-12-31 2013-07-04 深圳华大基因科技有限公司 Method and system for testing fusion gene
US9552458B2 (en) 2012-03-16 2017-01-24 The Research Institute At Nationwide Children's Hospital Comprehensive analysis pipeline for discovery of human genetic variation
US8209130B1 (en) 2012-04-04 2012-06-26 Good Start Genetics, Inc. Sequence assembly
US9104656B2 (en) 2012-07-03 2015-08-11 International Business Machines Corporation Using lexical analysis and parsing in genome research
US10777301B2 (en) 2012-07-13 2020-09-15 Pacific Biosciences For California, Inc. Hierarchical genome assembly method using single long insert library
US20140066317A1 (en) 2012-09-04 2014-03-06 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
EP2918047A4 (en) 2012-11-06 2016-04-20 Hewlett Packard Development Co Enhanced graph traversal
CA2888779A1 (en) 2012-11-07 2014-05-15 Good Start Genetics, Inc. Validation of genetic tests
EP2946345B1 (en) 2013-01-17 2024-04-03 Personalis, Inc. Methods and systems for genetic analysis
WO2014116921A1 (en) 2013-01-24 2014-07-31 New York University Utilization of pattern matching in stringomes
US10032195B2 (en) 2013-03-13 2018-07-24 Airline Tariff Publishing Company System, method and computer program product for providing a fare analytic engine
US9477779B2 (en) 2013-03-15 2016-10-25 Neo Technology, Inc. Graph database devices and methods for partitioning graphs
US9898575B2 (en) 2013-08-21 2018-02-20 Seven Bridges Genomics Inc. Methods and systems for aligning sequences
US9116866B2 (en) 2013-08-21 2015-08-25 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
WO2015048753A1 (en) 2013-09-30 2015-04-02 Seven Bridges Genomics Inc. Methods and system for detecting sequence variants
EP3052651B1 (en) 2013-10-01 2019-11-27 Life Technologies Corporation Systems and methods for detecting structural variants
KR102386134B1 (en) 2013-10-15 2022-04-12 리제너론 파마슈티칼스 인코포레이티드 High resolution allele identification
US10832797B2 (en) 2013-10-18 2020-11-10 Seven Bridges Genomics Inc. Method and system for quantifying sequence alignment
KR20160062763A (en) 2013-10-18 2016-06-02 세븐 브릿지스 지노믹스 인크. Methods and systems for genotyping genetic samples
AU2014337093B2 (en) 2013-10-18 2020-07-30 Seven Bridges Genomics Inc. Methods and systems for identifying disease-induced mutations
WO2015058120A1 (en) 2013-10-18 2015-04-23 Seven Bridges Genomics Inc. Methods and systems for aligning sequences in the presence of repeating elements
US9092402B2 (en) 2013-10-21 2015-07-28 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
EP3092317B1 (en) 2014-01-10 2021-04-21 Seven Bridges Genomics Inc. Systems and methods for use of known alleles in read mapping
US9817944B2 (en) 2014-02-11 2017-11-14 Seven Bridges Genomics Inc. Systems and methods for analyzing sequence data
WO2016141294A1 (en) 2015-03-05 2016-09-09 Seven Bridges Genomics Inc. Systems and methods for genomic pattern analysis
US20160364523A1 (en) 2015-06-11 2016-12-15 Seven Bridges Genomics Inc. Systems and methods for identifying microorganisms
US10793895B2 (en) 2015-08-24 2020-10-06 Seven Bridges Genomics Inc. Systems and methods for epigenetic analysis
US10724110B2 (en) 2015-09-01 2020-07-28 Seven Bridges Genomics Inc. Systems and methods for analyzing viral nucleic acids
US10584380B2 (en) 2015-09-01 2020-03-10 Seven Bridges Genomics Inc. Systems and methods for mitochondrial analysis
US20170199960A1 (en) 2016-01-07 2017-07-13 Seven Bridges Genomics Inc. Systems and methods for adaptive local alignment for graph genomes
US20170199959A1 (en) 2016-01-13 2017-07-13 Seven Bridges Genomics Inc. Genetic analysis systems and methods
US10364468B2 (en) 2016-01-13 2019-07-30 Seven Bridges Genomics Inc. Systems and methods for analyzing circulating tumor DNA
US10262102B2 (en) 2016-02-24 2019-04-16 Seven Bridges Genomics Inc. Systems and methods for genotyping with graph reference

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11488688B2 (en) 2013-08-21 2022-11-01 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
US11211146B2 (en) 2013-08-21 2021-12-28 Seven Bridges Genomics Inc. Methods and systems for aligning sequences
US11837328B2 (en) 2013-08-21 2023-12-05 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
US10325675B2 (en) 2013-08-21 2019-06-18 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
US10053736B2 (en) 2013-10-18 2018-08-21 Seven Bridges Genomics Inc. Methods and systems for identifying disease-induced mutations
US10832797B2 (en) 2013-10-18 2020-11-10 Seven Bridges Genomics Inc. Method and system for quantifying sequence alignment
US11447828B2 (en) 2013-10-18 2022-09-20 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
US10078724B2 (en) 2013-10-18 2018-09-18 Seven Bridges Genomics Inc. Methods and systems for genotyping genetic samples
US11049587B2 (en) 2013-10-18 2021-06-29 Seven Bridges Genomics Inc. Methods and systems for aligning sequences in the presence of repeating elements
US10204207B2 (en) 2013-10-21 2019-02-12 Seven Bridges Genomics Inc. Systems and methods for transcriptome analysis
US10055539B2 (en) 2013-10-21 2018-08-21 Seven Bridges Genomics Inc. Systems and methods for using paired-end data in directed acyclic structure
US10867693B2 (en) 2014-01-10 2020-12-15 Seven Bridges Genomics Inc. Systems and methods for use of known alleles in read mapping
US11756652B2 (en) 2014-02-11 2023-09-12 Seven Bridges Genomics Inc. Systems and methods for analyzing sequence data
US10878938B2 (en) 2014-02-11 2020-12-29 Seven Bridges Genomics Inc. Systems and methods for analyzing sequence data
US10192026B2 (en) 2015-03-05 2019-01-29 Seven Bridges Genomics Inc. Systems and methods for genomic pattern analysis
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US11568957B2 (en) 2015-05-18 2023-01-31 Regeneron Pharmaceuticals Inc. Methods and systems for copy number variant detection
US11697835B2 (en) 2015-08-24 2023-07-11 Seven Bridges Genomics Inc. Systems and methods for epigenetic analysis
US10793895B2 (en) 2015-08-24 2020-10-06 Seven Bridges Genomics Inc. Systems and methods for epigenetic analysis
US10584380B2 (en) 2015-09-01 2020-03-10 Seven Bridges Genomics Inc. Systems and methods for mitochondrial analysis
US10724110B2 (en) 2015-09-01 2020-07-28 Seven Bridges Genomics Inc. Systems and methods for analyzing viral nucleic acids
US11649495B2 (en) 2015-09-01 2023-05-16 Seven Bridges Genomics Inc. Systems and methods for mitochondrial analysis
US11702708B2 (en) 2015-09-01 2023-07-18 Seven Bridges Genomics Inc. Systems and methods for analyzing viral nucleic acids
US11347704B2 (en) 2015-10-16 2022-05-31 Seven Bridges Genomics Inc. Biological graph or sequence serialization
US11810648B2 (en) 2016-01-07 2023-11-07 Seven Bridges Genomics Inc. Systems and methods for adaptive local alignment for graph genomes
US11560598B2 (en) 2016-01-13 2023-01-24 Seven Bridges Genomics Inc. Systems and methods for analyzing circulating tumor DNA
US10262102B2 (en) 2016-02-24 2019-04-16 Seven Bridges Genomics Inc. Systems and methods for genotyping with graph reference
US10790044B2 (en) 2016-05-19 2020-09-29 Seven Bridges Genomics Inc. Systems and methods for sequence encoding, storage, and compression
US11289177B2 (en) 2016-08-08 2022-03-29 Seven Bridges Genomics, Inc. Computer method and system of identifying genomic mutations using graph-based local assembly
US11250931B2 (en) 2016-09-01 2022-02-15 Seven Bridges Genomics Inc. Systems and methods for detecting recombination
US11062793B2 (en) 2016-11-16 2021-07-13 Seven Bridges Genomics Inc. Systems and methods for aligning sequences to graph references
US10319465B2 (en) 2016-11-16 2019-06-11 Seven Bridges Genomics Inc. Systems and methods for aligning sequences to graph references
US11347844B2 (en) 2017-03-01 2022-05-31 Seven Bridges Genomics, Inc. Data security in bioinformatic sequence analysis
US10726110B2 (en) 2017-03-01 2020-07-28 Seven Bridges Genomics, Inc. Watermarking for data security in bioinformatic sequence analysis
US10949466B2 (en) * 2017-04-24 2021-03-16 Oracle International Corporation Multi-source breadth-first search (Ms-Bfs) technique and graph processing system that applies it
US10540398B2 (en) * 2017-04-24 2020-01-21 Oracle International Corporation Multi-source breadth-first search (MS-BFS) technique and graph processing system that applies it
US11435283B2 (en) 2018-04-17 2022-09-06 Sharif University Of Technology Optically detecting mutations in a sequence of DNA
WO2019226976A1 (en) * 2018-05-25 2019-11-28 New York Institute Of Technology Method and system for use in direct sequencing of rna
CN112236824A (en) * 2018-05-31 2021-01-15 皇家飞利浦有限公司 Systems and methods for allelic interpretation using map-based reference genomes
WO2020123906A1 (en) * 2018-12-13 2020-06-18 The General Hospital Corporation Biologically informed and accurate sequence alignment
CN111798923A (en) * 2019-05-24 2020-10-20 中国科学院计算技术研究所 Fine-grained load characteristic analysis method and device for gene comparison and storage medium
WO2021058683A1 (en) * 2019-09-25 2021-04-01 Koninklijke Philips N.V. Variant calling for multi-sample variation graph
WO2023170091A1 (en) 2022-03-09 2023-09-14 Politecnico Di Milano Methods for the alignment of sequence reads to non-acyclic genome graphs on heterogeneous computing systems

Also Published As

Publication number Publication date
US20240096450A1 (en) 2024-03-21
WO2017120128A1 (en) 2017-07-13
US20200058374A1 (en) 2020-02-20
US11810648B2 (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US11810648B2 (en) Systems and methods for adaptive local alignment for graph genomes
US11756652B2 (en) Systems and methods for analyzing sequence data
US10192026B2 (en) Systems and methods for genomic pattern analysis
US10204207B2 (en) Systems and methods for transcriptome analysis
Canzar et al. Short read mapping: an algorithmic tour
Heyne et al. GraphClust: alignment-free structural clustering of local RNA secondary structures
Boussau et al. Efficient likelihood computations with nonreversible models of evolution
Hoffmann et al. Fast mapping of short sequences with mismatches, insertions and deletions using index structures
Kamal et al. De-Bruijn graph with MapReduce framework towards metagenomic data classification
US20220261384A1 (en) Biological graph or sequence serialization
AU2014340461A1 (en) Systems and methods for using paired-end data in directed acyclic structure
Shaw et al. Theory of local k-mer selection with applications to long-read alignment
US20180247016A1 (en) Systems and methods for providing assisted local alignment
Wei et al. CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data
Vineetha et al. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning
Ekim et al. Efficient mapping of accurate long reads in minimizer space with mapquik
Salmela et al. Fast and accurate correction of optical mapping data via spaced seeds
GB2567390A (en) Method for building character sequence dictionary, method for searching character sequence dictionary, and system for processing character sequence dictionary
Mahmud et al. Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
Hudeček Improving and extending the multiple sequence alignment suite PRALINE
Vyverman ALFALFA: fast and accurate mapping of long next generation sequencing reads

Legal Events

Date Code Title Description
AS Assignment

Owner name: VENTURE LENDING & LEASING VII, INC., CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:SEVEN BRIDGES GENOMICS INC.;REEL/FRAME:039038/0535

Effective date: 20160613

AS Assignment

Owner name: SEVEN BRIDGES GENOMICS INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GHOSE, KAUSHIK;LEE, WAN-PING;REEL/FRAME:039921/0056

Effective date: 20160817

AS Assignment

Owner name: SEVEN BRIDGES GENOMICS D.O.O., SERBIA

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:VENTURE LENDING & LEASING VII, INC.;REEL/FRAME:044174/0050

Effective date: 20170925

Owner name: SEVEN BRIDGES GENOMICS UK LTD., UNITED KINGDOM

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:VENTURE LENDING & LEASING VII, INC.;REEL/FRAME:044174/0050

Effective date: 20170925

Owner name: SEVEN BRIDGES GENOMICS INC., MASSACHUSETTS

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:VENTURE LENDING & LEASING VII, INC.;REEL/FRAME:044174/0050

Effective date: 20170925

AS Assignment

Owner name: BROWN RUDNICK, MASSACHUSETTS

Free format text: NOTICE OF ATTORNEY'S LIEN;ASSIGNOR:SEVEN BRIDGES GENOMICS INC.;REEL/FRAME:044174/0113

Effective date: 20171011

AS Assignment

Owner name: MJOLK HOLDING BV, NETHERLANDS

Free format text: SECURITY INTEREST;ASSIGNOR:SEVEN BRIDGES GENOMICS INC.;REEL/FRAME:044305/0871

Effective date: 20171013

AS Assignment

Owner name: SEVEN BRIDGES GENOMICS INC., MASSACHUSETTS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:MJOLK HOLDING BV;REEL/FRAME:045928/0013

Effective date: 20180412

AS Assignment

Owner name: SEVEN BRIDGES GENOMICS INC., MASSACHUSETTS

Free format text: TERMINATION AND RELEASE OF NOTICE OF ATTORNEY'S LIEN;ASSIGNOR:BROWN RUDNICK LLP;REEL/FRAME:046943/0683

Effective date: 20180907

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION