WO2015081754A1

WO2015081754A1 - Genome compression and decompression

Info

Publication number: WO2015081754A1
Application number: PCT/CN2014/088400
Authority: WO
Inventors: Jiandong Ding; Junchi Yan; Yanan Zhang; Min GONG; Yunjie QIU
Original assignee: International Business Machines Corporation; Ibm (China) Co., Limited
Priority date: 2013-12-06
Filing date: 2014-10-11
Publication date: 2015-06-11
Also published as: CN104699998A; DE112014005580T5; US10679727B2; US20160306919A1

Abstract

The present invention relates to a method and apparatus for genome compression and decompression. In one embodiment of the present invention, there is provided a method for genome compression, comprising: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome's multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome comprising at least the index and the difference data. In other embodiments, there is provided an apparatus for genome compression. Further, there are provided a method and apparatus for decompressing the genome that has been compressed using the above method and apparatus. By means of the technical solution of the present invention, the data compression ratio can be enhanced, and a specified position in the genome can be accessed without a need to decompress the entire genome.

Description

GENOME COMPRESSION AND DECOMPRESSION

FIELD

Various embodiments of the present invention relate to data compression and decompression， and more specifically， to a method and apparatus for genome compression and decompression.

BACKGROUND

With the development of biology， research on biological genes has gone deeper and deeper， e. g. ， into various aspects such as human health， medicine research &development， new plant and animal species and microorganism.

In short， sequencing biological genomes refers to recording a sequence of base pairs composing the chromosome of the organism. Usually the process of measuring a genome of the first sample of a species is referred to as sequencing， while the process of measuring a genome of other sample of the species is referred to as re-sequencing. A breakthrough has been achieved in sequencing and re-sequencing technologies， with various involved costs going increasingly lower. More and more individuals and/or organizations come to realize the significance of genomes， and so far genome data of a large amount of species have been obtained through sequencing/re-sequencing process.

Human genes comprise about 3 billion base pairs； according to existing representation modes， human genomes consist of about 6 billion characters (characters A， G， T and C) . Therefore， storing each genome takes up much storage space. When there is a need to store a large amount of genomes or to copy and transmit genomes， there comes up a challenge regarding how to enhance the data storage/data transmission efficiency.

SUMMARY

Biologists have found there is certain similarity among genomes of various samples of the same species. For example， the similarity among human genomes is much higher than the similarity between genomes of humans and other species； further， the similarity among genomes of the yellow race is usually higher than the similarity between genomes of the yellow race and the white race.

Therefore， it is desired to develop a technical solution for compressing/decompressing a genome based on the similarity among genomes. It is desired that the technical solution can be integrated with existing genome storage modes and make full use of the similarity among genomes and further achieve efficient compression/decompression； in addition， while effectively enhancing the data compression ratio， it is further desired that decompression can be implemented with respect to only a portion of the genome rather than decompressing the entire genome.

In one embodiment of the present invention， there is provided a method for genome compression， comprising： selecting from a reference database a reference genome that matches the genome； building an index based on positions of the reference genome’s multiple segments in the reference genome； aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and generating a compressed genome， the compressed genome comprising at least the index and the difference data.

In one embodiment of the present invention， the selecting from a reference database a reference genome that matches the genome comprises： selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence in reference genomes in the reference database.

In one embodiment of the present invention， the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length. If annotation information associated with the reference genome can be obtained， then the information is considered in preference.

In one embodiment of the present invention， there is provided a method for genome decompression， comprising： in response to receiving a compressed genome that has been compressed according to a method of the present invention， obtaining from a reference database a reference genome that matches the compressed genome； and decompressing， according to an index in the compressed genome， the compressed genome based on difference data between the reference genome and the compressed genome.

In one embodiment of the present invention， there is provided an apparatus for genome compression， comprising： a selecting module configured to select from a reference database a reference genome that matches the genome； an indexing module configured to build an index based on positions of the reference genome’s multiple segments in the reference genome； an aligning module configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and a generating module configured to generate a compressed genome， the compressed genome comprising at least the index and the difference data.

In one embodiment of the present invention， the selecting module comprises at least one of： a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database； and a second selecting module configured to select the reference genome based on at least one predefined sequence in reference genomes in the reference database.

In one embodiment of the present invention， the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.

In one embodiment of the present invention， there is provided an apparatus for genome decompression， comprising： an obtaining module configured to， in response to receiving a compressed genome that has been compressed according to a method of the present invention， obtain from a reference database a reference genome that matches the compressed genome； and a decompressing module configured to decompress， according to an index in the compressed genome， the compressed genome based on difference data between the reference genome and the compressed genome.

By means of the technical solution according to the embodiments of the present invention， a representative genome may be used as a reference genome； when storing a new to-be-processed genome， only difference between the to-be-processed genome and the reference genome is saved， thereby reducing the amount of data significantly. On the other hand， with the technical solution according to the embodiments of the present invention， where a compressed genome includes an index， any base pair in the genome can be found rapidly by querying the index， and further a gene segment desired to be accessed can be found rapidly without decompressing the entire compressed genome.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Through the more detailed description of some embodiments of the present disclosure in the accompanying drawings， the above and other objects， features and advantages of the present disclosure will become more apparent， wherein the same reference generally refers to the same components in the embodiments of the present disclosure.

Fig. 1 schematically shows an exemplary computer system which is applicable to implement the embodiments of the present invention；

Fig. 2 schematically shows a diagram of the data structure of a genome obtained from sequencing an organism；

Fig. 3 schematically shows a schematic view of a method for genome compression according to one embodiment；

Fig. 4 schematically shows a schematic view of a method for genome compression according to one embodiment of the present invention；

Fig. 5 schematically shows a schematic view of the process for building an index according to the embodiments of the present invention；

Figs. 6A to 6C schematically show respective schematic views for identifying difference data between a genome and a reference genome according to one embodiment of the present invention， respectively；

Fig. 7 schematically shows a flowchart of a method for decompressing a compressed genome according to one embodiment of the present invention； and

Fig. 8A schematically shows a block diagram of an apparatus for genome compression according to one embodiment of the present invention， and Fig. 8B schematically shows a block diagram of an apparatus for decompressing a compressed genome according to one embodiment of the present invention.

DETAILED DESCRIPTION

Some preferable embodiments will be described in more detail with reference to the accompanying drawings， in which the preferable embodiments of the present disclosure have been illustrated. However， the present disclosure can be implemented in various manners， and thus should not be construed to be limited to the embodiments disclosed herein. On the contrary， those embodiments are provided for the thorough and complete understanding of the present disclosure， and completely conveying the scope of the present disclosure to those skilled in the art.

As will be appreciated by one skilled in the art， aspects of the present invention may be embodied as a system， method or computer program product. Accordingly， aspects of the present invention may take the form of an entirely hardware embodiment， an entirely software embodiment (including firmware， resident software， micro-code， etc. ) or one embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit， ” “module” or “system. ” Furthermore， in some embodiments， aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium (s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be， for example， but not limited to， an electronic， magnetic， optical， electromagnetic， infrared， or semiconductor system， apparatus， or device， or any suitable combination of the foregoing. More specific examples (anon-exhaustive list) of the computer readable storage medium would include the following： an electrical connection having one or more wires， a portable computer diskette， a hard disk， a random access memory (RAM) ， a read-only memory (ROM) ， an erasable programmable read-only memory (EPROM or Flash memory) ， an optical fiber， a portable compact disc read-only memory (CD-ROM) ， an optical storage device， a magnetic storage device， or any suitable combination of the foregoing. In the context of this document， a computer readable storage medium may be any tangible medium that can contain， or store a program for use by or in connection with an instruction execution system， apparatus， or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein， for example， in baseband or as part of a carder wave. Such a propagated data signal may take any of a variety of forms， including， but not limited to， an electro-magnetic signal， optical signal， or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate， propagate， or transport a program for use by or in connection with an instruction execution system， apparatus， or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium， including but not limited to wireless， wireline， optical fiber cable， RF， etc. ， or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages， including an object oriented programming language such as Java， Smalltalk， C++ or the like and conventional procedural programming languages， such as the “C” programming language or similar programming languages. The program code may execute entirely on the user’s computer， partly on the user’s computer， as a stand-alone software package， partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario， the remote computer may be connected to the user’s computer through any type of network， including a local area network (LAN) or a wide area network (WAN) ， or the connection may be made to an external computer (for example， through the Intemet using an Intemet Service Provider) .

Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods， apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams， and combinations of blocks in the flowchart illustrations and/or block diagrams， can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer， special purpose computer， or other programmable data processing apparatus to produce a machine， such that the instructions， which execute via the processor of the computer or other programmable data processing apparatus， create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer， other programmable data processing apparatus， or other devices to function in a particular manner， such that the instructions stored in the computer readable medium produce an article of manufacture including instruction means which implements the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer， other programmable data processing apparatus， or other devices to cause a series of operational steps to be performed on the computer， other programmable data processing apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Referring now to Fig. 1， in which a block diagram of an exemplary computer system/server 12 which is applicable to implement the embodiments of the present invention is illustrated. Computer system/server 12 illustrated in Fig. 1 is only illustrative and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein.

As illustrated in Fig. 1， computer system/server 12 is illustrated in the form of a general-purpose computing device. The components of computer system/server 12 may include， but are not limited to， one or more processors or processing units 16， a system memory 28， and a bus 18 that couples various system components including the system memory 28 and processing units 16.

Bus 18 represents one or more of several types of bus structures， including a memory bus or memory controller， a peripheral bus， an accelerated graphics port， and a processor or local bus using any of a variety of bus architectures. By way of example， and not limitation， such architectures include Industry Standard Architecture (ISA) bus， Micro Channel Architecture (MCA) bus， Enhanced ISA (EISA) bus， Video Electronics Standards Association (VESA) local bus， and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12， and it includes both volatile and non-volatile media， removable and non-removable media.

System memory 28 can include computer system readable media in the form of volatile memory， such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other removable/non-removable， volatile/non-volatile computer system storage media. By way of example only， storage system 34 can be provided for reading from and writing to a non-removable， non-volatile magnetic media (not illustrated in Fig. 1 and typically called a “hard drive” ) . Although not illustrated in Fig. 1， a magnetic disk drive for reading from and writing to a removable， non-volatile magnetic disk (e. g. ， a “floppy disk” ) ， and an optical disk drive for reading from or writing to a removable， non-volatile optical disk such as a CD-ROM， DVD-ROM or other optical media can be provided. In such instances， each drive can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below， memory 28 may include at least one program product having a set (e. g. ， at least one) of program modules that are configured to carry out the functions of embodiments of the present invention.

Program/utility 40， having a set (at least one) of program modules 42， may be stored in memory 28 by way of example， and not limitation， as well as an operating system， one or more application programs， other program modules， and program data. Each of the operating system， one or more application programs， other program modules， and program data or some combination thereof， may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the present invention as described herein.

Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard， a pointing device， a display 24， etc. ； one or more devices that enable a user to interact with computer system/server 12； and/or any devices (e. g. ， network card， modem， etc. ) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet， computer system/server 12 can communicate with one or more networks such as a local area network (LAN) ， a general wide area network (WAN) ， and/or a public network (e. g. ， the Internet) via network adapter 20. As depicted， network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not illustrated， other hardware and/or software components could be used in conjunction with computer system/server 12. Examples， include， but are not limited to： microcode， device drivers， redundant processing units， external disk drive arrays， RAID systems， tape drives， and data archival storage systems， etc.

Fig. 2 schematically shows a diagram 200 of the data structure of a genome obtained from sequencing an organism. In this figure， reference numeral 210 shows a schematic view of a chromosome， and reference numeral 220 shows a schematic view of a genome. In short， a genome of an organism may be described by accurate arrangement of base pairs of deoxyribonucleic acid (DNA) . In other words， the genome may be represented by an ordered sequence constructed by A， G， T and C four bases. Genomes of different organisms have different lengths. For example， human genomes consist of about 3 billion base pairs (i. e. ， 6 billion characters) ， while genomes of other organisms may have different lengths.

Fig. 3 schematically shows a schematic view 300 of a method for genome compression according to one embodiment. Currently there have been proposed methods for genome compression by looking for differences between a current genome and a reference genome. As shown in Fig. 3， a genome 310 is a to-be-compressed genome， while a reference genome 320 is a “standard genome” serving as alignment basis. An aligmment may be made between the to-be-compressed genome 310 and the reference genome 320， and only difference data 330 between genome 310 and reference genome 320 are saved in a compressed genome.

With the development of network technologies， there already exist many organizations that can provide reference genomes， and these reference genomes can be accessed conveniently via networks. According to the genome compression method as shown in Fig. 3， by transmitting only difference data (e. g. ， difference data 330) between the genome and the reference genome during genome transmission， raw data of genome 310 can be obtained based on transmitted difference data 330 and reference genome 320 obtained from network access.

Although the above method can enhance the data compression efficiency to a given extent， there still exist the following drawbacks： on one hand， it is difficult to effectively select from multiple existing reference genomes a reference genome that best matches the to-be-compressed genome； on the other hand， difference data are compressed as a whole in order to achieve a higher compression ratio， whereas when it is desired to only access a base pair at a specific position in the genome， the raw genome must be decompressed before locating the specific base pair.

In view of these drawbacks in the above technical solution， the present invention proposes a method for genome compression. The method comprises： selecting from a reference database a reference genome that matches the genome； building an index based on positions of the reference genome’s multiple segments in the reference genome； aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and generating a compressed genome， the compressed genome at least comprising the index and the difference data.

Fig. 4 schematically shows a schematic view 400 of a method for genome compression according to one embodiment of the present invention. First of all， in step S402 a reference genome that matches the genome is selected from a reference database. Note multiple reference genomes are stored in the reference database here， and these reference genomes may come from multiple samples of multiple species， such as multiple reference genomes from different races (the white race， the yellow race， the brown race and the black race) ， and multiple reference genomes from various refined categories of other creatures. Since genomes of the same species have a higher similarity (i. e. ， text similarity among base characters in genomes) ， providing a reference database comprising abundant reference genomes helps to find a reference genome that better matches a to-be-compressed genome， so as to further enhance the data compression ratio. In the context of the present invention， “to match” represents that two genomes have a higher similarity.

In addition， the reference database mentioned in the present invention may further be enriched as new to-be-compressed genomes are processed. Detailed description will be presented below in this regard.

In step S404， an index is built based on positions of the reference genome’s multiple segments in the reference genome. Since a genome usually consists of billions of characters， an index may further be built in order to locate specific positions in the genome much quickly. An index may be built according to multiple segments in the reference genome. In the context of the present invention， a segment refers to bases between the starting position and the ending position in the genome. For example， at1g33500： 1-10000 represents the segment is named atlg33500， and the starting and ending positions of bases in the segment are 1 and 10000 respectively.

In the context of the present invention， for the sake of convenience， segments are defined according to biological functions of various bases in the genome， or segments are defined in other manners. Detailed description will be presented below in this regard.

In step S406， the genome is aligned with the reference genome based on the multiple segments， so as to identify difference data between the genome and the reference genome. Since a genome consists of a huge amount of bases， by taking each segment among the multiple segments as a unit， the base sequence in each segment of the reference genome is aligned with the to-be-compressed genome； when a portion that matches the segment is found in the to-be-compressed genome， only differences between the portion and a character sequence in the segment are recorded.

Finally in step S408， a compressed genome is generated， the compressed genome comprising at least the index and the difference data. Since the compressed genome does not include a base sequence that is the same as the reference genome， the space occupied by the compressed genome can be reduced greatly. When the reference database consists of only one reference genome， the compressed genome does not have to include an identifier of the reference genome； when the reference database consists of multiple reference genomes， the compressed genome should include identifiers of these reference genomes， so that it can be found through the identifiers which reference genome is used in compression.

In addition， new reference genomes may be added to the reference database gradually； for example， the reference database may be gradually updated during genome compression. Specifically， with respect to a newly inputted genome A， when no reference genome with a higher similarity can be found in the reference database， it may be considered that genome A may belong to a new species， and thus genome A may be added to a candidate list. When genomes in the candidate list amount to a certain number， a clustering method may be used and the most representative to-be-compressed genome obtained from clustering may be added to the reference database.

In addition， the purpose of including the index in the compressed genome lies in when there is a need to only access bases in a specific position range in the compressed genome， a portion corresponding to the specific position range can be quickly found among the difference data by the index， and then partial decompression is conducted based on the reference genome and the corresponding portion among the difference data， rather than the whole genome being decompressed and then a specified position range being found therein.

Various approaches may be used to find in a reference database a reference genome that matches the genome. Specifically， the reference database may further include additional information describing at least one phenotypic trait of each reference genome， and the phenotypic trait may include multiple aspects， such as skin color， hair color for humans. Therefore， the phenotypic trait characterizing each reference genome may be described by a multi-dimensional vector V_PT＝ (pt1， pt2， ... ) . In addition， 10 levels from 1 to 10 may be set to describe colors from white to black. Therefore， the multi-dimensional vector may be represented as V_PT＝ (2， 3， ... ) . Phenotypic traits in the reference database may be stored in a format as shown in Table 1 below.

Table 1 Phenotypic Trait

Reference Genome No.	Skin Color	Hair Color	...
Reference Genome No.	Skin Color	Hair Color	...	1	2	3	...
2	3	9	...	1	2	3	...
2	3	9	...	...	...	...	...

Since phenotypic traits of the to-be-compressed genome can be collected， a reference genome that is similar to the to-be-compressed genome can be selected by comparing phenotypic traits of the to-be-compressed genome and each reference genome. In one embodiment of the present invention， the selecting the reference genome comprises： calculating a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database； and selecting the reference genome with the first similarity larger than a first threshold.

Those skilled in the art may adopt various approaches to calculating the similarity. For example， an Euclidean distance between a vector V1 describing the phenotypic trait of the to-be-compressed genome and a vector V2 describing a phenotypic trait of a reference genome in the reference database is calculated and used as the first similarity. Alternatively， if the importance of a certain phenotypic trait is considered to be higher， a higher weight may be assigned to the phenotypic trait.

The reference genome with the first similarity larger than a first threshold may be selected； or when there exist multiple reference genomes each having a similarity larger than the first threshold， then the reference genome with the higher similarity may be selected. Those skilled in the art may further adopt other approaches to selecting the reference genome.

In one embodiment of the present invention， the selecting the reference genome comprises： with respect to a current reference genome in the reference database， determining a first position set of the at least one predefined sequence in the genome， and determining a second position set of the at least one predefined sequence in the current reference genome； calculating a second similarity between the first position set and the second position set； and selecting the reference genome based on the second similarity.

If it is impossible to select the reference genome based on phenotypic traits， then the reference genome may be selected based on the similarity between positions of the predefined sequence in the to-be-compressed genome and the reference genome. In the context of the present invention， the predefined sequence may be a base sequence that only exerts little impact on the division of species. For example， since humans belong to mammals， human genomes include some conserved base sequence segments that are the same as lower mammals； although humans can further be categorized into the white race， the yellow race and other races， genomes of each race include these conserved base sequence segments.

Nowadays biologists have successfully identified conserved base sequence segments that are correlated to each species， by comparing the similarity between positions of these conserved base sequence segments in the to-be-compressed genome and the reference genome， a species to which the to-be-compressed genome belongs can be inferred approximately， and further it helps to select a reference genome that is more similar to the to-be-compressed genome.

For humans， suppose multiple conserved base sequence segments have been identified， positions of these conserved base sequences in various reference genomes can be stored in a structure as shown in Table 2 below.

Table 2 Conserved Base Sequence Segments

Like the above concrete example shown with reference to phenotypic traits， positions of multiple conserved base sequence segments in one genome may be described by a vector， e. g. ， V_sM＝ (position 1， position 2， ...) . A first position set (e. g. ， represented as a vector V_SM1) of the multiple conserved base sequence segments in the genome is determined， a second position set (e. g. ， represented as a vector V_sM2) of the multiple conserved base sequence segments in the reference genome is also determined， and a similarity between the two vectors is calculated whereby the reference genome is selected.

Like the above approach to selecting the reference genome based on phenotypic traits， in one embodiment of the present invention， the reference genome may be selected based on an approach to calculating an Euclidean distance or other approaches.

The accuracy of selecting the reference genome based on positions of conversed base sequence segments is yet to be improved， so first multiple reference genomes each having a similarity larger than a specific threshold may be selected from the reference database， and then the most appropriate reference genome is selected from the multiple reference genomes.

In one embodiment of the present invention， the selecting the reference genome based on the second similarity comprises： adding to a candidate list a reference genome with the second similarity lager than a second threshold； and comparing the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.

In one embodiment of the present invention， a Multiple Sequence Alignment (MSA) may be used to compare the to-be-compressed genome with multiple candidate reference genomes. The Multiple Sequence Alignment is an alignment of three or more biological sequences (protein， DNA， etc. ) . For more details of the Multiple Sequence Alignment， reference may be made to http： //en. wikipedia. org/wiki/Multiple_sequence_alignment， which is not detailed in this specification.

In one embodiment of the present invention， multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.

In addition to the above phenotypic traits and conserved base sequence segments， the reference database may further include annotation information associated with each reference genome. The annotation information mentioned here may be， for example， annotation information that describes functions of a base sequence between certain starting and ending positions. For example， suppose a base sequence between a starting position of 1 and an ending position of 10000 is correlated to human skin color， then annotation may be added with respect to the base sequence between positions 1-10000， indicating this base portion is correlated to human skin color. In addition， other types of annotation may be added to base sequences at other positions in the genome.

So far biologists have cracked definitions of some base sequences and added lots of annotations to genomes. Therefore， segments may be defined based on starting and ending positions of base sequences associated with these annotations. In addition， since annotations are added to only one part of base sequences in genomes， other part without annotation information may be divided according to a predefined step-length. For example， division may be conducted according to a unit of 1000 bases. Or other predefined step may further be set.

Fig. 5 schematically shows a schematic view 500 of the process for building an index according to one embodiment of the present invention. As shown in Fig. 6， an annotation 1 520 and an annotation 2 522 represent two annotations of a reference genome 510 in a reference database. Starting and ending positions of a base sequence corresponding to annotation 1 520 in the entire genome are position 1 540 and position 2 542， respectively. Therefore， the portion between position 1 540 and position 2 542 may act as one segment (e. g. ， a segment N 530) . In addition， starting and ending positions of a base sequence corresponding to annotation 2 522 in the entire genome are position 2 540 and a position 3 544， respectively. Therefore， the portion between position 2 542 and position 3 544 may act as another segment (e. g. ， a segment N+1 532) . Similarly， other portion without annotation information in reference genome 510 may further be divided into segments according to a predefined step 524， whereby a segment N+2 534 is obtained. In this manner， entire reference genome 510 may be divided into multiple segments.

In one embodiment of the present invention， one of the multiple segments in the reference genome may act as a basic unit for alignment with the to-be-compressed genome； further， one sub-segment in a segment may act as a basic unit for alignment. Alignment with a sub-segment as the basic unit possibly helps to enhance the probability of matching but also might complicate the index.

In one embodiment of the present invention， the aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome comprises： with respect to a sub-segment of a current segment among the multiple segments， looking up in the genome a core area that is similar to text of the sub-segment； taking text difference between the core area and the sub-segment as at least one part of the difference data； and adding to the difference data other part than the core area in the genome.

Figs. 6A to 6C schematically show respective schematic views 600A to 600C for identifying difference data between the genome and the reference genome according to one embodiment of the present invention. Each of the multiple segments may be aligned with the to-be-compressed genome. Description is presented below to the process regarding how to align one segment with the to-be-compressed genome only.

Specifically， the comparison may be made based on an n-gram and using a sliding window. Since the genome is a character sequence consisting of A， G， T and C four bases and with billions of magnitude orders of length， an analysis may be conducted by means of n-gram in a Probabilistic Language Model. For more details of the n-gram， reference may be made to http： //en. wikipedia. org/wiki/N-gram， which is not detailed in this specification.

In one embodiment of the present invention， based on a sum of scores corresponding to multiple n-grams in a current segment， an area whose sum is larger than a predefined threshold is used as the core area. Suppose a 3-gram (i. e. ， alignment is made with 3 bases as the basic unit) is used in one embodiment， and a score of each n-gram is calculated based on a BLOSUM matrix in this embodiment. Fig. 6A shows a score calculated with respect to each of 3-grams in a base sequence “ATGCGT....” Specifically， scores of these four basic units 3-gram 1 to 3-gram 4 are 13， 16， 14 and 18， respectively.

Fig. 6B shows how to calculate a score indicating whether the to-be-compressed genome is similar to a sub-segment in the current segment. Take 3-grams for example. When scores of 3-grams in the to-be-compressed genome and in the sub-segment are the same， the total score is +2； when the scores are different， the total score is -3. By comparing a to-be-compressed genome 610B with a sub-segment in a current segment 612B in Fig. 6B， the total score ＝2+2+2+2+2+2+2-3+2+2+2+2+2+2+2-3+2＝24. When the total score exceeds a predefined threshold， it may be considered that the base sequence in the to-be-compressed genome is a core area that is similar to text of the sub-segment.

After finding the core area that is similar to text of the sub-segment of the current segment， text difference between the core area and the sub-segment of the current segment is looked for， and the found text difference acts as one part of the difference data. Specifically， Fig. 6C shows a concrete example of the difference data. As shown in block 620C in Fig. 6C， the base in a to-be-compressed genome 610C differs from the base in a current segment 620C (i. e. ， there exists text difference) ， and the difference may be recorded as (c， A， 15) ， where “c” represents a change-type difference， “A” represents changing the base in the reference genome to a base “A” ， and the difference appears in the 15^th base.

Similarly， for a difference shown in block 622C in Fig. 6C， it may be represented as (d， T， 9) ， where “d” represents a delete-type difference， “T” represents deleting a base “T， ” and the difference appears in the 9^th base after the last difference. Similarly， those skilled in the art may further define an insert-type difference.

Note the difference data may be saved in a manner associated with the current segment. For example， the index may include an association between the difference data and the current segment， i. e. ， the association may represent to which segment in the reference genome the difference data corresponds. Specifically， an identifier of a segment associated with difference data may be added to the header of the difference data. For example， suppose difference data (d， T， 9) shown in block 622C in Fig. 6C are associated with a segment “seg1” in the reference genome， then the difference data may be recorded as “seg1 (d， T， 9) ” . Note here given is only an example of the representation of difference data， and those skilled in the art may further use other data structure to record difference data， e. g. ， recording in a 4-tuple form.

In one embodiment of the present invention， when difference data correspond to a core area in a to-be-processed genome which is similar to text of a sub-segment of a current segment (i. e. ， difference data inside a core area) ， the current segment may be used as a segment associated with the difference data. In addition， when difference data are other data than various core areas in a to-be-processed genome (i. e. ， difference data outside core areas) ， a segment corresponding to a core area preceding (or following) the difference data may be used as a segment associated with the difference data. In this manner， a correspondence relationship among difference data and segments in the reference genome may be recorded explicitly. Based on the correspondence relationship and index， a specific portion in the compressed genome may be decompressed conveniently.

With the example described above， each segment (or each sub-segment of a segment) in the reference genome may be aligned with the to-be-compressed genome so as to find a corresponding core area and record text difference between each core area and a corresponding current segment (or sub-segment of a segment) . For other portions than core areas in the to-be-compressed genome， it may be considered that no base sequence that is similar to these portions exists in the reference genome， so these portions may be added to the difference data directly.

In one embodiment of the present invention， with respect to the sub-segment of the current segment among the multiple segments， the core area is expanded forward and/or backward in the genome； and in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold， the expanded core area is used as an expanded core area (final matching area) .

Description has been presented above to how to find a core area. Alternatively， the core area may further be expanded forward and/or backward in the to-be-compressed genome. For example， the core area is expanded with one base as a step-length at a time. For example， a comparison is made as to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area； when the difference is lower than a predefined threshold， the core area is expanded. Note the expansion should not be implemented without limit but aims to enhance the compression ratio.

Fig. 7 schematically shows a flowchart 700 of a method for decompressing a compressed genome according to one embodiment of the present invention. Specifically， in step S702， in response to receiving a compressed genome that has been compressed according to a method of the present invention， a reference genome that matches the compressed genome is obtained from a reference database. Since an index of the compressed genome saves information of the reference genome similar to the genome， the reference genome may be obtained via the information from the reference database.

Next in step S704， the compressed genome is decompressed， according to the index in the compressed genome， based on difference data between the reference genome and the compressed genome. In addition， since difference data in the compressed genome saves difference data between the genome and the reference genome， the difference data may be applied to the reference genome so as to restore a raw genome from the compressed genome.

In one embodiment of the present invention， there is further comprised： in response to a request for access to a specified portion in the compressed genome， searching for difference data corresponding to the specified portion in the difference data according to the index； and decompressing the specified portion based on the difference information and the reference genome.

As described in the above procedure for building the index， one skilled in the art may understand the the index indicates respective portions in the difference data corresponds to which segment or segments in the reference genome. By this manner， a certain portion in the compressed genome may be decompressed conveniently.

Fig. 8A schematically shows a block diagram 800A of an apparatus for genome compression according to one embodiment of the present invention. Specifically， there is provided an apparatus for genome compression， comprising： a selecting module 810A configured to select from a reference database a reference genome that matches the genome； an indexing module 820A configured to build an index based on positions of the reference genome’s multiple segments in the reference genome； an aligning module 830A configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and a generating module 840A configured to generate a compressed genome， the compressed genome comprising at least the index and the difference data.

In one embodiment of the present invention， selecting module 810A comprises at least one of： a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database； and a selecting module configured to select the reference genome based on at least one predefined sequence included in reference genomes in the reference database.

In one embodiment of the present invention， the first selecting module comprises： a calculating module configured to calculate a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database； and a first selecting unit configured to select the reference genome with the first similarity larger than a first threshold.

In one embodiment of the present invention， the second selecting module comprises： a position determining module configured to， with respect to a current reference genome in the reference database， determine a first position set of the at least one predefined sequence in the genome， and determine a second position set of the at least one predefined sequence in the current reference genome； a position similarity calculating module configured to calculate a second similarity between the first position set and the second position set； and a second selecting module configured to select the reference genome based on the second similarity.

In one embodiment of the present invention， the second selecting module comprises： a candidate list generating module configured to add to a candidate list a reference genome with the second similarity lager than a second threshold； and a multiple sequence comparing module configured to compare the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.

In one embodiment of the present invention， aligning module 830A comprises： a core area generating module configured to， with respect to a sub-segment of a current segment among the multiple segments， look up in the genome a core area that is similar to text of the sub-segment； a first difference data generating module configured to take text difference between the core area and the sub-segment as at least one part of the difference data； and a second difference data generating module configured to add to the difference data other part than the core area in the genome.

In one embodiment of the present invention， the core area generating module further comprises： a first expanding module configured to， with respect to the sub-segment of the current segment among the multiple segments， expand the core area forward and/or backward in the genome； and a second expanding module configured to， in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold， use the expanded core area as an expanded core area.

Fig. 8B schematically shows a block diagram 800B of an apparatus for decompressing a compressed genome according to one embodiment of the present invention. Specifically， there is provided an apparatus for genome decompression， comprising： an obtaining module 810B configured to， in response to receiving a compressed genome that has been compressed according to a method of the present invention， obtain from a reference database a reference genome that matches the compressed genome； and a decompressing module 820B configured to decompress， according to an index in the compressed genome， the compressed genome based on difference data between the reference genome and the compressed genome.

In one embodiment of the present invention， there is further comprised： a locating module configured to， in response to a request for access to a specified portion in the compressed genome， search for difference data corresponding to the specified portion in the difference data according to the index； and a partial decompressing module configured to decompress the specified portion based on the difference information and the reference genome.

The flowchart and block diagrams in the Figures illustrate the architecture， functionality， and operation of possible implementations of systems， methods and computer program products according to various embodiments of the present invention. In this regard， each block in the flowchart or block diagrams may represent a module， segment， or portion of code， which comprises one or more executable instructions for implementing the specified logical function (s) . It should also be noted that， in some alternative implementations， the functions noted in the block may occur out of the order noted in the figures. For example， two blocks illustrated in succession may， in fact， be executed substantially concurrently， or the blocks may sometimes be executed in the reverse order， depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration， and combinations of blocks in the block diagrams and/or flowchart illustration， can be implemented by special purpose hardware-based systems that perform the specified functions or acts， or combinations of special purpose hardware and computer instructions.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration， but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments， the practical application or technical improvement over technologies found in the marketplace， or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

A method for genome compression， comprising：

selecting from a reference database a reference genome that matches the genome；

building an index based on positions of the reference genome’s multiple segments in the reference genome；

aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and

generating a compressed genome， the compressed genome comprising at least the index and the difference data.
The method according to Claim 1， wherein the selecting from a reference database a reference genome that matches the genome comprises：

selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence included in reference genomes in the reference database.
The method according to Claim 2， wherein the selecting the reference genome comprises：

calculating a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database； and

selecting the reference genome with the first similarity larger than a first threshold.
The method according to Claim 2， wherein the selecting the reference genome comprises： with respect to a current reference genome in the reference database，

determining a first position set of the at least one predefined sequence in the genome， and determining a second position set of the at least one predefined sequence in the current reference genome；

calculating a second similarity between the first position set and the second position set； and

selecting the reference genome based on the second similarity.
The method according to Claim 4， wherein the selecting the reference genome based on the second similarity comprises：

adding to a candidate list a reference genome with the second similarity lager than a second threshold； and

comparing the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
The method according to any of Claims 1-5， wherein the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
The method according to any of Claims 1-5， wherein the aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome comprises：

with respect to a sub-segment of a current segment among the multiple segments，

looking up in the genome a core area that is similar to text of the sub-segment；

taking text difference between the core area and the sub-segment as at least one part of the difference data； and

adding to the difference data other part than the core area in the genome.
The method according to Claim 7， further comprising： with respect to the sub-segment of the current segment among the multiple segments，

expanding the core area forward and/or backward in the genome； and

in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold， using the expanded core area as an expanded core area.
A method for genome decompression， comprising：

in response to receiving a compressed genome that has been compressed according to a method according to any of Claims 1-8， obtaining from a reference database a reference genome that matches the compressed genome； and

decompressing， according to an index in the compressed genome， the compressed genome based on difference data between the reference genome and the compressed genome.
The method according to Claim 9， further comprising：

in response to a request for access to a specified portion in the compressed genome， searching for difference data corresponding to the specified portion in the difference data according to the index； and

decompressing the specified portion based on the difference information and the reference genome.
An apparatus for genome compression， comprising：

a selecting module configured to select from a reference database a reference genome that matches the genome；

an indexing module configured to build an index based on positions of the reference genome’s multiple segments in the reference genome；

an aligning module configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome； and

a generating module configured to generate a compressed genome， the compressed genome comprising at least the index and the difference data.
The apparatus according to Claim 11， wherein the selecting module comprises at least one of：

a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database； and

a second selecting module configured to select the reference genome based on at least one predefined sequence included in reference genomes in the reference database.
The apparatus according to Claim 12， wherein the first selecting module comprises：

a calculating module configured to calculate a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database； and

a first selecting unit configured to select the reference genome with the first similarity larger than a first threshold.
The apparatus according to Claim 12， wherein the second selecting module comprises：

a position determining module configured to， with respect to a current reference genome in the reference database， determine a first position set of the at least one predefined sequence in the genome， and determine a second position set of the at least one predefined sequence in the current reference genome；

a position similarity calculating module configured to calculate a second similarity between the first position set and the second position set； and

a second selecting module configured to select the reference genome based on the second similarity.
The apparatus according to Claim 14， wherein the second selecting unit comprises：

a candidate list generating module configured to add to a candidate list a reference genome with the second similarity lager than a second threshold； and

a multiple sequence comparing module configured to compare the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
The apparatus according to any of Claims 11-15， wherein the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
The apparatus according to any of Claims 11-15， wherein the aligning module comprises：

a core area generating module configured to， with respect to a sub-segment of a current segment among the multiple segments， look up in the genome a core area that is similar to text of the sub-segment；

a first difference data generating module configured to take text difference between the core area and the sub-segment as at least one part of the difference data； and

a second difference data generating module configured to add to the difference data other part than the core area in the genome.
The apparatus according to Claim 17， wherein the core area generating module further comprises：

a first expanding module configured to， with respect to the sub-segment of the current segment among the multiple segments， expand the core area forward and/or backward in the genome； and

a second expanding module configured to， in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold， use the expanded core area as an expanded core area.
An apparatus for genome decompression， comprising：

an obtaining module configured to， in response to receiving a compressed genome that has been compressed according to a method according to any of Claims 1-8， obtain from a reference database a reference genome that matches the compressed genome； and

a decompressing module configured to decompress， according to an index in the compressed genome， the compressed genome based on difference data between the reference genome and the compressed genome.
The apparatus according to Claim 19， further comprising：

a locating module configured to， in response to a request for access to a specified portion in the compressed genome， search for difference data corresponding to the specified portion in the difference data according to the index； and

a partial decompressing module configured to decompress the specified portion based on the difference information and the reference genome.
A computer program comprising program code adapted to perform the method steps of any of claims 1 to 10 when said program is run on a computer.