WO2019017806A1

WO2019017806A1 - An apparatus and method for identifying haplotypes

Info

Publication number: WO2019017806A1
Application number: PCT/RU2017/000538
Authority: WO
Inventors: Dmitry Yurievich IGNATOV; Alexander Nikolaevich Filippov; Xuecang ZHANG
Original assignee: Huawei Technologies Co., Ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2019-01-24
Also published as: CN111344794B; CN111344794A

Abstract

The invention relates to an apparatus (400) for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence. The apparatus (400) comprises a processing unit (401) configured to generate an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, wherein each allele of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generate a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generate a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identify haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

Description

AN APPARATUS AND METHOD FOR IDENTIFYING HAPLOTYPES

TECHNICAL FIELD In general, the present invention relates broadly to the field of genetics. More specifically, the present invention relates to an apparatus and method for identifying haplotypes in a plurality of sample nucleotide sequences.

BACKGROUND

In modern biology and medicine, there is a wide range of genetic tasks to be accomplished, such as identification of inherited diseases or investigation of genome variations in populations of different species. These tasks require identification of haplotypes, i.e., groups of alleles, which tend to be inherited together. Despite the importance of haplotyping, the long duration of this process and its computational expensiveness essentially restrict the application of haplotyping in medical practice and scientific researches.

Generally, haplotype identification is performed for sequences of nucleotides, which are mapped to regions of a reference sequence, where the probability of nucleotide compliance is maximal (see figure 1). On the basis of this mapping these regions are selected for haplotyping by de-novo haplotype assembly, which does not take into account the mapping of sequences and is executed within the selected regions. The de- novo reassembling method highly increases computational complexity and time of haplotyping, but it is still worth using this reassembling method because of high frequencies of repetitions of nucleotide sequences in a genome. As can be seen in figure 1 , a sequence can change the place of its alignment after haplotype assembly if the reference has repetitive sequences. The sequence relocation is thus possible if most nucleotides in a sequence match repetitive subsequences in the reference and the others are mismatched. Obviously, the shorter the repetitive subsequences in the reference are, the smaller possibility that other nucleotides in the sequence do not match the reference.

The regions for haplotyping are usually quite short, such as in a length of 100 to 500 nucleotides. Considering the upper bound of this range, i.e. 500 nucleotides, and taking into account the human nuclear genome which consists of ~3 x 10⁹ base pairs, it is interesting that if we deny (for reassembling) that a sequence is located exactly in its current alignment, then the probability that the sequence belongs to the current region is less than 10^"6 (500 divided by 3 x10⁹). From this point of view, the reassembling within regions would not make sense.

Considering reassembling a sequence of 100 nucleotides with a random distribution of four different types of nucleotides, even if the sequence comprises many mismatches (e.g., 20) compared to the reference sequence, the possibility to find another alignment with the same or even better nucleotide compliance for this sequence is less than ~10^⁷ (4-(ioo-20) _x 00). This means that haplotyping without reassembling yields 1 mistake per ~10³⁷ genomes (10³⁷ = 1/(3 x10⁹ x10⁴⁷)). With increasing repeatability in a reference, the probability of sequence realignment rises. Therefore, it's important to have a method for quick identification of regions overloaded with repetitions, where reassembling is reasonable and practical.

It is understood that the speed of haplotyping can be higher and the quality not worse without sequence reassembling than with it. However, previous works do not provide any efficient method for multi-genome haplotyping without reassembling. Thus, a method of haplotyping without reassembling is needed, wherein the method can use information about current alignments of sequences for quickly and efficiently aggregating these sequences into haplotypes. As well, a method for quickly identifying regions where reassembling makes sense is needed.

One of the newest and the most efficient methods of haplotyping is a mixture model for single individual haplotyping (MixSIH), which provides a binary representation of two haplotypes as described in "MixSIH: a mixture model for single individual haplotyping" by Matsumoto H. and Kiryu H., BMC Genomics 14, S5, 2013. Based on a binary model and a 'minimum connectivity' score, it provides an accuracy measure of haplotype consistency. With this approach, MixSIH extracts highly accurate haplotype segments in the following steps, as illustrated in figure 2.

The MixSIH method begings with selection of different nucleotides, i.e. extraction of alleles (step 1 ). To improve peformance, the alleles are subsequently transformad into a binary format (step 2). In step 3 the most probable alleles are selected with a proposed probability function. Finally, haplotypes are selected on the basis of a connectivity score in step 4, which comprises the substeps shown in figure 3. However, the state-of-the-art MixSIH method has several critical problems as follows: the MixSIH method performs merely single individual haplotyping and cannot be applied to the multiple genomes; the MixSIH method specializes in single individual haplotyping and, thus, cannot produce more than two haplotypes; the MixSIH method uses complex formulas in the process of haplotype inferring and thus cannot provide optimal performances; the MixSIH method does not support de-novo assembly of haplotypes and can lose the quality of haplotyping in the regions with a high frequency of repetitions; the MixSIH method does not take into account the Phred quality of nucleotide identification and thus cannot produce results with best precision. In light of the above, there is a need for an improved apparatus and method, which provide haplotyping with high efficiency and precision, allow applying haplotyping to multiple genomes, and enable poly-haplotyping with generation of more than two haplotypes. SUMMARY

It is an object of the invention to provide an improved apparatus and method, which guarantee haplotyping with high efficiency and precision, allow applying haplotyping to multiple genomes, and enable poly-haplotyping with generation of more than two haplotypes.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

Generally, the invention relates to an apparatus and method for identifying haplotypes in a plurality of sample nucleotide sequences. More specifically, a novel apparatus and method are provided for overlapping haplotyping in the regions with a low frequency of repetitions of nucleotide subsequences in order to overcome the drawbacks of conventional haplotyping methods. The present invention offers several significant advantages compared to the prior art: firstly, the invention provides a method of identifying haplotypes in a sample of multiple genomes. In contrast to existing solutions, this method can take into account all available alleles and their possible combinations. Secondly, the invention develops a method for selection of an expected number of haplotypes. In contrast to existing solutions, this method can take into account the expected number of haplotypes in different steps of haplotyping. Thirdly, the invention provides a method of aggregating haplotypes efficiently, which results in an improved performance via the support of the simplest way of haplotyping in contrast to existing solutions. Fourthly, the invention provides a method of generating results with maximal precision by using all available information for clever assembly of haplotypes. Finally, the invention provides a method of applying appropriate assembly procedures to the regions with different frequencies of repetitions.

More specifically, according to a first aspect an apparatus for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence is provided, wherein the apparatus comprises a processing unit configured to generate an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, wherein each allele (represented by deletion, insertion or single-nucleotide polymorphism) of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generate a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generate a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identify haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

Thus, an improved apparatus for identifying haplotypes is provided, allowing haplotyping for multiple genomes, providing the results of haplotyping with high efficiency and precision, and enabling poly-haplotyping with generation of more than two haplotypes. In a further possible implementation form of the first aspect, the processing unit is further configured to filter the initial set of allele sequences by removing accidental variants from the initial set of allele sequences. In a further possible implementation form of the first aspect, the processing unit is configured to filter the initial set of allele sequences by removing the accidental variants from the initial set of allele sequences by removing those allele sequences from the initial set of allele sequences that have an appearance frequency below a filtering threshold value, wherein the appearance frequency indicates how many times an allele sequence repeats among the initial set of allele sequences.

In a further possible implementation form of the first aspect, the processing unit is further configured to remove those allele sequences from the first aggregated set of allele sequences that are portions of at least one other allele sequence of the first aggregated set of allele sequences.

In a further possible implementation form of the first aspect, the processing unit is configured to generate the initial set of allele sequences by extracting those allele sequences from the plurality of sample nucleotide sequences that have at least one nucleotide not matching the corresponding nucleotide of the reference nucleotide sequence at the corresponding nucleotide site.

In a further possible implementation form of the first aspect, the processing unit is configured to combine those allele sequences from the initial set of allele sequences that have an overlapping allele portion into an aggregated sequence, wherein the aggregated sequence comprises the overlapping sequence portion and non-overlapping alleles from those allele sequences in an order of nucleotide sites associated with alleles, i.e.

according to the alignment of each of those allele sequences with the reference nucleotide sequence.

In a further possible implementation form of the first aspect, in case the number of allele sequences of the second aggregated set of allele sequences is larger than an expected value, the processing unit is further configured to identify haplotypes in the plurality of sample nucleotide sequences by calculating a probability measure for each allele sequence of the second aggregated set of allele sequences on the basis of a statistical method, wherein the probability measure indicates the probability that an allele sequence is a haplotype and by identifying haplotypes in the second aggregated set of allele sequences on the basis of the probability measure.

In a further possible implementation form of the first aspect, the statistical method comprises a Bayesian method on the basis of a Hidden Markov Model (HMM).

In a further possible implementation form of the first aspect, the processing unit is further configured to determine the number of repetitions in the reference nucleotide sequence and to identify the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the number of repetitions is lower than a repetition threshold.

In a further possible implementation form of the first aspect, the processing unit is further configured to:

(i) generate, if the reference nucleotide sequence has a next nucleotide symbol, a hash code on the basis of the next nucleotide symbol;

(ii) to increment a counter value, in case the generated hash code is already part of a set of generated hash codes or add, in case the generated hash code is not part of the set of generated hash codes, the generated hash code to the set of generated hash codes;

(iii) repeat (i) and (ii) as long as the counter value is smaller than a predefined threshold counter value; and

(iv) identify the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the counter value is smaller than the predefined threshold counter value. In a further possible implementation form of the first aspect, the processing unit is configured to generate the hash code on the basis of the next nucleotide symbol by: replacing the nucleotide symbol (A, C, G or T) with unique sequences of two bits; shifting the current value of the hash code by 2 bits left; applying the bitwise OR operation to the shifted hash code and the corresponding unique sequence of two bits; applying a binary mask to the result of the bitwise OR operation, wherein the first two bits of the binary mask are 0 and the remaining bits of the binary mask are 1.

According to a second aspect the invention relates to a method for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence. The method comprises: generating an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence, wherein each allele of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generating a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generating a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identifying haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

Thus, an improved method for identifying haplotypes is provided, allowing haplotyping for multiple genomes, providing the results of haplotyping with high efficiency and precision, and enabling poly-haplotyping with generation of more than two haplotypes. In a further possible implementation form of the second aspect, the method further comprises filtering the initial set of allele sequences by removing accidental variants from the initial set of allele sequences.

In a further possible implementation form of the second aspect, the step of filtering the initial set of allele sequences by removing the accidental variants from the initial set of allele sequences comprises removing those allele sequences from the initial set of allele sequences that have an appearance frequency below a filtering threshold value, wherein the appearance frequency indicates how many times an allele sequence repeats among the initial set of allele sequences. In a further possible implementation form of the second aspect, the method further comprises removing those allele sequences from the first aggregated set of allele sequences that are portions of at least one other allele sequence of the first aggregates set of allele sequences. In a further possible implementation form of the second aspect, the step of generating the initial set of allele sequences comprises extracting those allele sequences from the plurality of sample nucleotide sequences that have at least one nucleotide not matching the corresponding nucleotide of the reference nucleotide sequence at the corresponding nucleotide site.

In a further possible implementation form of the second aspect, the step of combining comprises combining those allele sequences from the initial set of allele sequences that have an overlapping allele portion into an aggregated sequence, wherein the aggregated sequence comprises the overlapping sequence portion and non-overlapping alleles from those allele sequences in an order of nucleotide sites associated with alleles, i.e.

In a further possible implementation form of the second aspect, in case the number of allele sequences of the second aggregated set of allele sequences is larger than an expected value, the step of identifying haplotypes in the plurality of sample nucleotide sequences comprises calculating a probability measure for each allele sequence of the second aggregated set of allele sequences on the basis of a statistical method, wherein the probability measure indicates the probability that an allele sequence is a haplotype and identifying haplotypes in the second aggregated set of allele sequences on the basis of the probability measure.

In a further possible implementation form of the second aspect, the method comprises the further steps of: (i) generating, if the reference nucleotide sequence has a next nucleotide symbol, a hash code on the basis of the next nucleotide symbol;

(ii) incrementing a counter value, in case the generated hash code is already part of a set of generated hash codes or add, in case the generated hash code is not part of the set of generated hash codes, the generated hash code to the set of generated hash codes;

(iii) repeating steps (i) and (ii) as long as the counter value is smaller than a predefined threshold counter value; and

(iv) identifying the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the counter value is smaller than the predefined threshold counter value. In a further possible implementation form of the second aspect, the step of generating the hash code on the basis of the next nucleotide symbol comprises: replacing the nucleotide symbol (A, C, G or T) with unique sequences of two bits; shifting the current value of the hash code by 2 bits left; applying the bitwise OR operation to the shifted hash code and the corresponding unique sequence of two bits; applying a binary mask to the result of the bitwise OR operation, wherein the first two bits of the binary mask are 0 and the remaining bits of the binary mask are 1.

According to a third aspect the invention relates to a computer program comprising program code for performing the method according to the second aspect, when executed on a computer or a processor.

The invention can be implemented in hardware and/or software.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, wherein:

Fig. 1 shows a schematic diagram illustrating the local reassembling of nucleotide sequences on the reference/haplotype sequences; Fig. 2 shows a schematic diagram of a mixture model for single individual haplotyping;

Fig. 3 shows a schematic diagram illustrating selection of allele sequences in a mixture model for single individual haplotyping;

Fig. 4 shows a schematic diagram of an apparatus for identifying haplotypes according to an embodiment;

Fig. 5 shows a schematic diagram illustrating a corresponding method of identifying haplotypes according to an embodiment;

Fig. 6 shows a schematic diagram illustrating a method for haplotyping implemented in an apparatus according to an embodiment; Fig. 7 shows a schematic diagram illustrating different stages of a method for haplotyping implemented in an apparatus according to an embodiment;

Figs. 8A, 8B and 8C show schematic diagrams illustrating the identification of haplotypes as implemented in embodiments of the invention;

Fig. 9 shows a schematic diagram of an adaptive strategy of haplotyping implemented in an apparatus according to an embodiment;

Fig. 10 shows a diagram illustrating the generation of unique hash codes for nucleotide sequences as implemented in an apparatus according to an embodiment;

Fig. 1 1 shows a schematic diagram of a modified pipeline of a genome analysis toolkit as implemented in an apparatus according to an embodiment; Fig. 12 shows a table of results of the De Bruijn graph reassembling (DBGR) and overlapping assembly (OA) implemented in embodiments of the invention;

Fig. 13 shows a schematic diagram visualizing the haplotypes (chromosome 4: 190610 - 190645 kb) generated with De Bruijn graph reassembling (DBGR) and overlapping assembly (OA) implemented in embodiments of the invention; and Figs. 1 A and 14B show schematic diagrams illustrating comparisons of precision and execution time between the haplotyping method implemented in embodiments of the invention and the conventional De Bruijn graph approach. In the various figures, identical reference signs will be used for identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF EMBODIMENTS In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be placed. It will be appreciated that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.

For instance, it will be appreciated that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures.

Moreover, in the following detailed description as well as in the claims embodiments with different functional blocks or processing units are described, which are connected with each other or exchange signals. It will be appreciated that the present invention covers embodiments as well, which include additional functional blocks or processing units that are arranged between the functional blocks or processing units of the embodiments described below.

Finally, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 4 shows a schematic diagram of an apparatus 400 for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence. As will be described in more detail further below, the apparatus 400 comprises a processing unit 401 configured to: generate an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, wherein each allele (represented by deletion, insertion or single-nucleotide polymorphism) of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generate a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generate a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identify haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

Figure 5 illustrates the steps of a corresponding method 500 for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence. The method 500 comprises the following steps: generating 501 an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence, wherein each allele (represented by deletion, insertion or single-nucleotide polymorphism) of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generating 503 a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generating 505 a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identifying 507 haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

Further embodiments, implementations forms and details of the apparatus 400 shown in figure 4 and the method 500 illustrated in figure 5 will be described in the following, where the method 500 will also be referred to as overlapping haplotyping.

A further embodiment of the method 500 (as well as the corresponding apparatus 400) is illustrated in figure 6 as the overlapping haplotyping method 600. The overlapping haplotyping method 600 comprises the following main steps: 601 extracting allele sequences from sequences of nucleotide symbols; 603 filtering out the rare alleles from the allele sequences using predefined filtering threshold; 605 aggregating the allele sequences with the same alleles in overlaps; 607 removing allele sequences which are the fragments of other allele sequences; 609 aggregating allele sequences without overlapping alleles; and 61 1 selecting the most probable of them if the number of assembled haplotypes is larger than expected (i.e. larger than a predefined threshold).

Figure 7 shows a schematic diagram illustrating different overlapping haplotyping stages implemented in the apparatus 400 and the method 500.

On the basis of an alignment of a plurality of sample nucleotide sequences with the reference nucleotide sequence, the first stage is to compare them and to select alleles, such as nucleotide mismatches, missed or inserted nucleotides. Every allele comprises information about its location (i.e., absolute location within the reference sequence), symbols and/or type of changes, such as single nucleotide polymorphism, deletion, and insertion. A selected sequence of alleles comprises information about its starting and ending within the reference and a collection of bounded alleles. Extracted allele sequences can be used for haplotype aggregation in the following steps. Thus, as already described above, the processing unit 401 of the apparatus 400 is configured to generate an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, wherein each allele of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence.

Moreover, as already described above, the processing unit 401 of the apparatus 400 is configured to generate the initial set of allele sequences by extracting those allele sequences from the plurality of sample nucleotide sequences that have at least one nucleotide not matching the corresponding nucleotide of the reference nucleotide sequence at the corresponding nucleotide site.

The second stage is to filter out all accidental and rare alleles based on the allele sequences, an input value of a filtering threshold per haplotype and an expected quantity of haplotypes. To apply filtering, the filtering threshold can be firstly calculated for all haplotypes as thresh per - _hap,_otyVe _{a nd then the i n itia| set of g| |e |e se ces}

Expected quantity of haplotypes

can be filtered by removing the accidental variants from the initial set of allele sequences by removing those allele sequences from the initial set of allele sequences that have an appearance frequency below the filtering threshold value, wherein the appearance frequency indicates how many times an allele sequence repeats among the initial set of allele sequences. This stage reduces the computational complexity and increases the speed of following operations.

After filtering, the processing unit 401 of the apparatus 400 can start to aggregate allele sequences into haplotypes in stage 3 of figure 7. As already described above, to this end the processing unit 401 is configured to generate a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence.

According to an embodiment, the processing unit 401 of the apparatus 400 is configured to combine those allele sequences from the initial set of allele sequences that have an overlapping allele portion into an aggregated sequence, wherein the aggregated sequence comprises the overlapping sequence portion and non-overlapping alleles from those allele sequences in an order of nucleotide sites associated with alleles, i.e.

according to the alignment of each of those allele sequences with the reference nucleotide sequence. This is further illustrated in figures 8A-C. By way of example, there are two allele GCC (at sites 1 - 3) and TA (at sites 6 - 7) in different sequences, as shown in figure 8A. It is questionable whether they are from the same or from different haplotypes. The reasons to move alleles into the same haplotype can be, for instance, that the alleles are in the same sequence or that different sequences have an overlapping sequence portion. According to the above reasons, alleles GCCCC (sites 1 - 5) and CTTA (sites 4 - 7), as shown in figure 8B, belong to different haplotypes as they are in different sequences and have different alleles C and T overlapping with each other at site 5. On the other hand, alleles GCCCC (sites 1 - 5) and CCAT (sites 4 - 7) are in the same haplotype as shown in figure 8C, since their overlapping sequence portion comprises the same alleles CC at sites 4 and 5. Therefore, different haplotypes can be found on the basis of identification of different alleles overlapping with each other, while sequences with the same overlapping alleles can be merged into one haplotype. Thus, to find all possible variants of allele aggregations, according to an embodiment they can be merged in various cycles until new allele aggregations cannot be merged with others anymore.

In stage 4 of figure 7, according to an embodiment the processing unit 401 of the apparatus 400 is further configured to remove those allele sequences from the first aggregated set of allele sequences that are portions, i.e. fragments of at least one other allele sequence of the first aggregated set of allele sequences.

When the aggregated allele sequences have no overlapping alleles among one another, it is possible that they are either in the same or in different haplotypes. Therefore, as already described above, in stage 5 of figure 7 the processing unit 401 of the apparatus 400 is configured to generate a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles.

In stage 6 of figure 7, in case the number of allele sequences of the second aggregated set of allele sequences is larger than an expected value (e.g. a predefined threshold), the processing unit 401 of the apparatus 400 is further configured to identify haplotypes in the plurality of sample nucleotide sequences by calculating a probability measure for each allele sequence of the second aggregated set of allele sequences on the basis of a statistical method, wherein the probability measure indicates the probability that an allele sequence is a haplotype and by identifying haplotypes in the second aggregated set of allele sequences on the basis of the probability measure.

The statistical method can comprise a Bayesian method on the basis of a Hidden Markov Model (HMM), i.e. a pair-HMM method described in "Haplotype inference using a Hidden Markov Model with efficient Markov Chain sampling" by Shuying S., a thesis for the degree of doctor of philosophy, Toronto, 2007. This method is implemented, for example, in the Genome Analysis Toolkit. As already described above, in a final stage the processing unit 401 of the apparatus 400 is configured to identify haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences and these haplotypes are the output of the overlapping method. Figure 9 shows a schematic diagram of an adaptive strategy of haplotyping implemented in the apparatus 400 and the method 500 according to an embodiment. In an

embodiment, the apparatus 400 is configured to determine whether the reference sequence has any repetitions. If the number of repetitions (also referred to as frequency) is larger than a predefined threshold, the apparatus 400 can be configured to use a conventional de-novo assembly, in particular De Bruijn graph reassembling. Otherwise, i.e. if the number of repetitions is smaller than the predefined threshold, the apparatus 400 can be configured to use overlapping haplotyping as implemented by embodiments of the invention. In an embodiment, the adaptive haplotyping method comprises a novel method of hash- code generation, which is illustrated in figure 10. The hash code generation implemented in the apparatus 400 according to an embodiment comprises the following main steps: firstly, initializing an integer Count and a Hash-Code with zeros, and an empty Set;

secondly, if the reference sequence of nucleotide symbols has a next nucleotide symbol, selecting the next nucleotide symbol; on the basis of the selected nucleotide symbol generating a unique Hash-Code; if the Set contains the Hash-Code, then incrementing the value for the Count, or else adding the Hash-Code into the Set; if the Count value is equal to a predefined threshold, finishing the loop and using the de-novo assembly (e.g, the De Bruijn graph method); and thirdly, if after counting all identical Hash-Codes in region the Count value is still lower than the predefined threshold, then generating haplotypes by overlapping haplotyping method.

The method of adaptive haplotyping comprises three main stages. In stage 1 , it begins with initialization of an integer Count and a Hash-Code with zero values and creation of an empty Set with integers, which are used in the next step. In stage 2, if the reference sequence within the current region has a next nucleotide symbol, this symbol is selected and used for generating a unique Hash-Code, which will be described further below. If the Set contains the generated Hash-Code, then the Count value is incremented, otherwise the Hash-Code is added into the Set. When the Count is incremented, it will be checked whether it is equal to a predefined threshold; if true, the cycle is finished and the de-novo assembly will be used for the current region, wherein the de-novo assembly method can comprise the known reassembling by the De Bruijn graph, which is implemented, for example, in the open-source Genome Analysis Toolkit. In stage 3, if the loop is finished and the Count value is still lower than the predefined threshold, the overlapping haplotyping method can be applied for generating haplotypes.

According to an embodiment, the efficiency of the adaptive haplotyping is mainly determined by the generation of the unique Hash-Code. This Hash-Code method is applied to the nucleotide subsequence of a predefined length and comprises the following steps illustrated in figure 10: a first step of replacing nucleotide symbols with

corresponding values from 0 to 3 (i.e., A:0; C:1 , G:2, T:3); a second step of shifting the current value of the Hash-Code left by 2 bits; a third step of applying the bitwise OR to the result of the previous step and the nucleotide value (from the first step); a fourth step of applying a binary mask to the result of the previous step, wherein the last 2^*(predefined subsequence length) bits are filled with 1 and others with 0 and returning the result as a new value of the Hash-Code.

Thus, according to an embodiment, the method of adaptive haplotyping can efficiently perform haplotyping on the regions of a genome with different frequencies of repetitions of nucleotide sequences by generating a unique Hash-Code for quick identification of repetitive subsequences of a predefined length, and hence can determine to apply the novel method, i.e., overlapping haplotyping, which is applicable to the regions with a low frequency of repetitions, or to apply the de-novo assembly method, which is applicable to the regions with a high frequency of repetitions.

According to an embodiment, the method of overlapping haplotyping 500 is performed in a genome of esophageal squamous cell carcinoma with a high frequency of alleles. A modified version of the open-source software Genome Analysis Toolkit is used, which is provided by Broad Institute. In figure 11 , a schematic diagram of the modified pipeline of the Genome Analysis Toolkit is shown. The Genome Analysis Toolkit already provides implemented identification of active regions for haplotyping 1101 , de-novo assembly of plausible haplotypes with De Bruijn graph 1102, and selection of haplotypes by Pair-HMM 1103. In this embodiment the assembly by De Bruijn graph is replaced with the assembly by the overlapping haplotyping (above steps 1 to 5). The peculiarities of step implementation, input parameters and results of haplotyping will be described further below.

To compare results of the original and modified implementation, an expected quantity of haplotypes, 2, and a filtering threshold per haplotype, 3 % are provided as inputs for overlapping haplotyping according to an embodiment. Four intervals of a genome of esophageal squamous cell carcinoma with a high frequency of mutations have been analyzed.

Figure 12 shows a table illustrating results and execution times of haplotyping with the De Bruijn graph reassembling (DBGR) and overlapping assembly (OA) for different intervals of the genome. As can be taken from the table of figure 12, the quality and quantity of alleles identified by the overlapping method according to embodiments of the invention are generally better than those identified by the conventional algorithm; in particular, the execution times of the overlapping haplotyping method are improved by 3 to 4 times.

The visualization of the haplotyping results is presented in figure 13, wherein the haplotypes of chromosome 4 in the interval 190610 - 190645 kb generated by the De Bruijn graph reassembling and overlapping assembly implemented in embodiments of the invention are shown. Figure 13 shows that the alleles identified by both methods are almost the same and the quantities of the identified alleles are also very similar, which confirms that the overlapping assembly (OA) implemented by embodiments of the invention can work as well as if not better than the conventional De Bruijn graph reassembling (DBGR). According to an embodiment, the proposed method for adaptive haplotyping can efficiently identify haplotypes in a human genome, wherein the adaptive haplotyping can be performed by a modified version of the Genome Analysis Toolkit. Again, the input parameters comprise an expected quantity of haplotypes, 2, and a filtering threshold per haplotype, 3%. Haplotypes are identified in the 20th chromosome of the human genome NA12878 form the dataset provided by the University of California, Berkeley. An assessment of the haplotyping quality is conducted by the open-source software SMaSH tool provided by the University of California, Berkeley and is shown below in figures 1 A and 14B.

Figure 14A shows a schematic diagram illustrating a comparison of precision between the adaptive haplotyping implemented in embodiments of the invention and the De Bruijn garph as a function of a repetition value (R), wherein the y-axis indicates a ratio of precision of the adaptive haplotyping to that of the De Bruijn graph and the x-axis indicates a level of repetition (R). Similarly, figure 14B shows a schematic diagram illustrating a comparison of execution time between the adaptive haplotyping implemented in embodiments of the invention and the De Bruijn garph as a function of a repetition value (R), where the y-axis indicates a ratio of execution time of the adaptive haplotyping to that of the De Bruijn graph and the x- axis indicates a level of repetition (R).

The predefined level of repetition (R) can be used regarding a length and a quantity of repetitive subsequences in the following steps of the overlapping haplotyping method. During generation of a unique Hash-Code for creating a binary mask: the last 2xR bits are filled with 1. During implementation of the adaptive strategy, the de-novo reassembling with the De Bruijn graph method can be used for haplotyping if the quantity of the identified repetitive subsequences is higher than R, or the overlapping assembly implemented in embodiments of the invention can be used if the quantity of the identified repetitive subsequences is lower than R. According to figures 14A and 14B, at the threshold of repetitions, R = 9, the execution time of the adaptive haplotyping implemented in embodiments of the invention is 2 times shorter than that of the De Bruijn graph reassembling and the precision is no less. The result of the best accuracy is shown at R = 8, with an execution time improved by 1.7 times. According to these results, the value of R = 8 can be recommended as the threshold for the length and for quantity of repetitive subsequences when the adaptive haplotyping implemented in embodiments of the invention is applied to the human genome.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1. An apparatus (400) for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence, the apparatus (400) comprising a processing unit (401 ) configured to: generate an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, wherein each allele of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generate a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generate a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identify haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

2. The apparatus (400) of claim 1 , wherein the processing unit (401 ) is further configured to filter the initial set of allele sequences by removing accidental variants from the initial set of allele sequences.

3. The apparatus (400) of claim 2, wherein the processing unit (401) is configured to filter the initial set of allele sequences by removing the accidental variants from the initial set of allele sequences by removing those allele sequences from the initial set of allele sequences that have an appearance frequency below a filtering threshold value, wherein the appearance frequency indicates how many times an allele sequence repeats among the initial set of allele sequences.

4. The apparatus (400) of any one of the preceding claims, wherein the processing unit (401 ) is further configured to remove those allele sequences from the first aggregated set of allele sequences that are portions of at least one other allele sequence of the first aggregated set of allele sequences.

5. The apparatus (400) of any one of the preceding claims, wherein the processing unit (401 ) is configured to generate the initial set of allele sequences by extracting those allele sequences from the plurality of sample nucleotide sequences that have at least one nucleotide not matching the corresponding nucleotide of the reference nucleotide sequence at the corresponding nucleotide site.

6. The apparatus (400) of any one of the preceding claims, wherein the processing unit (401 ) is configured to combine those allele sequences from the initial set of allele sequences that have an overlapping allele portion into an aggregated sequence, wherein the aggregated sequence comprises the overlapping sequence portion and non- overlapping alleles from those allele sequences in an order of nucleotide sites associated with alleles, in particular according to the alignment of each of those allele sequences with the reference nucleotide sequence.

7. The apparatus (400) of any one of the preceding claims, wherein, in case the number of allele sequences of the second aggregated set of allele sequences is larger than an expected value, the processing unit (401 ) is further configured to identify haplotypes in the plurality of sample nucleotide sequences by calculating a probability measure for each allele sequence of the second aggregated set of allele sequences on the basis of a statistical method, wherein the probability measure indicates the probability that an allele sequence is a haplotype and by identifying haplotypes in the second aggregated set of allele sequences on the basis of the probability measure.

8. The apparatus (400) of claim 7, wherein the statistical method comprises a Bayesian method on the basis of a Hidden Markov Model (HMM).

9. The apparatus (400) of any one of the preceding claims, wherein the processing unit (401 ) is further configured to determine the number of repetitions in the reference nucleotide sequence and to identify the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the number of repetitions is lower than a repetition threshold.

10. The apparatus (400) of any one of the preceding claims, wherein the processing unit (401 ) is further configured to:

(ii) to increment a counter value, in case the generated hash code is already part of a set of generated hash codes or add, in case the generated hash code is not part of the set of generated hash codes, the generated hash code to the set of generated hash codes; (iii) repeat (i) and (ii) as long as the counter value is smaller than a predefined threshold counter value; and

(iv) identify the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the counter value is smaller than the predefined threshold counter value.

1 1. The apparatus (400) of claim 10, wherein the processing unit (401 ) is configured to generate the hash code on the basis of the next nucleotide symbol by: replacing the nucleotide symbol (A, C, G or T) with unique sequences of two bits; shifting the current value of the hash code by 2 bits left; applying the bitwise OR operation to the shifted hash code and the corresponding unique sequence of two bits; applying a binary mask to the result of the bitwise OR operation, wherein the first two bits of the binary mask are 0 and the remaining bits of the binary mask are 1.

12. A method (500) for identifying haplotypes in a plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence, the method (500) comprising: generating (501 ) an initial set of allele sequences by extracting a plurality of allele sequences from the plurality of sample nucleotide sequences on the basis of a reference nucleotide sequence, wherein each allele of each of the plurality of allele sequences is associated with a nucleotide site in the reference nucleotide sequence; generating (503) a first aggregated set of allele sequences on the basis of the initial set of allele sequences by combining those allele sequences from the initial set of allele sequences, which have the same alleles in overlapping sequence portions and belong to the same haplotype, into an aggregated allele sequence, wherein the first aggregated set of allele sequences comprises the aggregated allele sequences and the allele sequences from the initial set of allele sequences that are not combined into an aggregated allele sequence; generating (505) a second aggregated set of allele sequences on the basis of the first aggregated set of allele sequences by concatenating pairs of neighboring allele sequences from the first aggregated set of allele sequences, wherein neighboring allele sequences comprise alleles in neighboring nucleotide sites, but no overlapping alleles; and identifying (507) haplotypes in the plurality of sample nucleotide sequences on the basis of the second aggregated set of allele sequences.

13. The method (500) of claim 12, wherein the method (500) further comprises filtering the initial set of allele sequences by removing accidental variants from the initial set of allele sequences.

14. The method (500) of claim 13, wherein the step of filtering the initial set of allele sequences by removing the accidental variants from the initial set of allele sequences comprises removing those allele sequences from the initial set of allele sequences that have an appearance frequency below a filtering threshold value, wherein the appearance frequency indicates how many times an allele sequence repeats among the initial set of allele sequences.

15. The method (500) of any one of claims 12 to 14, wherein the method (500) further comprises removing those allele sequences from the first aggregated set of allele sequences that are portions of at least one other allele sequence of the first aggregated set of allele sequences.

16. The method (500) of any one of claims 12 to 15, wherein the step (501 ) of generating the initial set of allele sequences comprises extracting those allele sequences from the plurality of sample nucleotide sequences that have at least one nucleotide not matching the corresponding nucleotide of the reference nucleotide sequence at the corresponding nucleotide site.

17. The method (500) of any one of claims 2 to 16, wherein in the step of combining comprises combining those allele sequences from the initial set of allele sequences that have an overlapping allele portion into an aggregated sequence, wherein the aggregated sequence comprises the overlapping sequence portion and non-overlapping alleles from those allele sequences in an order of nucleotide sites associated with alleles, in particular according to the alignment of each of those allele sequences with the reference nucleotide sequence.

18. The method (500) of any one of claims 12 to 17, wherein, in case the number of allele sequences of the second aggregated set of allele sequences is larger than an expected value, the step (507) of identifying haplotypes in the plurality of sample nucleotide sequences comprises calculating a probability measure for each allele sequence of the second aggregated set of allele sequences on the basis of a statistical method, wherein the probability measure indicates the probability that an allele sequence is a haplotype and identifying haplotypes in the second aggregated set of allele sequences on the basis of the probability measure.

19. The method (500) of any one of claims 12 to 8, wherein the method (500) comprises the further steps of:

(i) generating, if the reference nucleotide sequence has a next nucleotide symbol, a hash code on the basis of the next nucleotide symbol; (ii) incrementing a counter value, in case the generated hash code is already part of a set of generated hash codes or add, in case the generated hash code is not part of the set of generated hash codes, the generated hash code to the set of generated hash codes; (iii) repeating steps (i) and (ii) as long as the counter value is smaller than a predefined threshold counter value; and

(iv) identifying the haplotypes in the plurality of sample nucleotide sequences on the basis of the reference nucleotide sequence, in case the counter value is smaller than the predefined threshold counter value.

20. The method (500) of claim 19, wherein the step of generating the hash code on the basis of the next nucleotide symbol comprises: replacing the nucleotide symbol (A, C, G or T) with unique sequences of two bits; shifting the current value of the hash code by 2 bits left; applying the bitwise OR operation to the shifted hash code and the corresponding unique sequence of two bits; applying a binary mask to the result of the bitwise OR operation, wherein the first two bits of the binary mask are 0 and the remaining bits of the binary mask are 1.

21. A computer program comprising program code for performing the method (500) according to claims 12 to 20, when executed on a computer or a processor.