WO2015081754A1 - Genome compression and decompression - Google Patents

Genome compression and decompression Download PDF

Info

Publication number
WO2015081754A1
WO2015081754A1 PCT/CN2014/088400 CN2014088400W WO2015081754A1 WO 2015081754 A1 WO2015081754 A1 WO 2015081754A1 CN 2014088400 W CN2014088400 W CN 2014088400W WO 2015081754 A1 WO2015081754 A1 WO 2015081754A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
reference genome
compressed
module configured
difference data
Prior art date
Application number
PCT/CN2014/088400
Other languages
French (fr)
Inventor
Jiandong Ding
Junchi Yan
Yanan Zhang
Min GONG
Yunjie QIU
Original Assignee
International Business Machines Corporation
Ibm (China) Co., Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm (China) Co., Limited filed Critical International Business Machines Corporation
Priority to DE112014005580.8T priority Critical patent/DE112014005580T5/en
Priority to US15/101,946 priority patent/US10679727B2/en
Publication of WO2015081754A1 publication Critical patent/WO2015081754A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • Various embodiments of the present invention relate to data compression and decompression, and more specifically, to a method and apparatus for genome compression and decompression.
  • sequencing biological genomes refers to recording a sequence of base pairs composing the chromosome of the organism.
  • sequencing the process of measuring a genome of the first sample of a species
  • re-sequencing the process of measuring a genome of other sample of the species.
  • Human genes comprise about 3 billion base pairs; according to existing representation modes, human genomes consist of about 6 billion characters (characters A, G, T and C) . Therefore, storing each genome takes up much storage space. When there is a need to store a large amount of genomes or to copy and transmit genomes, there comes up a challenge regarding how to enhance the data storage/data transmission efficiency.
  • Biologists have found there is certain similarity among genomes of various samples of the same species. For example, the similarity among human genomes is much higher than the similarity between genomes of humans and other species; further, the similarity among genomes of the yellow race is usually higher than the similarity between genomes of the yellow race and the white race.
  • a method for genome compression comprising: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome’s multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome comprising at least the index and the difference data.
  • the selecting from a reference database a reference genome that matches the genome comprises: selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence in reference genomes in the reference database.
  • the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length. If annotation information associated with the reference genome can be obtained, then the information is considered in preference.
  • a method for genome decompression comprising: in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtaining from a reference database a reference genome that matches the compressed genome; and decompressing, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  • an apparatus for genome compression comprising: a selecting module configured to select from a reference database a reference genome that matches the genome; an indexing module configured to build an index based on positions of the reference genome’s multiple segments in the reference genome; an aligning module configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and a generating module configured to generate a compressed genome, the compressed genome comprising at least the index and the difference data.
  • the selecting module comprises at least one of: a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database; and a second selecting module configured to select the reference genome based on at least one predefined sequence in reference genomes in the reference database.
  • the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
  • an apparatus for genome decompression comprising: an obtaining module configured to, in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtain from a reference database a reference genome that matches the compressed genome; and a decompressing module configured to decompress, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  • a representative genome may be used as a reference genome; when storing a new to-be-processed genome, only difference between the to-be-processed genome and the reference genome is saved, thereby reducing the amount of data significantly.
  • a compressed genome includes an index
  • any base pair in the genome can be found rapidly by querying the index, and further a gene segment desired to be accessed can be found rapidly without decompressing the entire compressed genome.
  • Fig. 1 schematically shows an exemplary computer system which is applicable to implement the embodiments of the present invention
  • Fig. 2 schematically shows a diagram of the data structure of a genome obtained from sequencing an organism
  • Fig. 3 schematically shows a schematic view of a method for genome compression according to one embodiment
  • Fig. 4 schematically shows a schematic view of a method for genome compression according to one embodiment of the present invention
  • Fig. 5 schematically shows a schematic view of the process for building an index according to the embodiments of the present invention
  • Figs. 6A to 6C schematically show respective schematic views for identifying difference data between a genome and a reference genome according to one embodiment of the present invention, respectively;
  • Fig. 7 schematically shows a flowchart of a method for decompressing a compressed genome according to one embodiment of the present invention.
  • Fig. 8A schematically shows a block diagram of an apparatus for genome compression according to one embodiment of the present invention
  • Fig. 8B schematically shows a block diagram of an apparatus for decompressing a compressed genome according to one embodiment of the present invention.
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or one embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, in some embodiments, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carder wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, an electro-magnetic signal, optical signal, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc. , or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Intemet using an Intemet Service Provider) .
  • LAN local area network
  • WAN wide area network
  • Intemet Service Provider an external computer
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instruction means which implements the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • FIG. 1 a block diagram of an exemplary computer system/server 12 which is applicable to implement the embodiments of the present invention is illustrated.
  • Computer system/server 12 illustrated in Fig. 1 is only illustrative and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein.
  • computer system/server 12 is illustrated in the form of a general-purpose computing device.
  • the components of computer system/server 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and processing units 16.
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
  • bus architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
  • Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12, and it includes both volatile and non-volatile media, removable and non-removable media.
  • System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
  • Computer system/server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not illustrated in Fig. 1 and typically called a “hard drive” ) .
  • a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e. g.
  • memory 28 may include at least one program product having a set (e. g. , at least one) of program modules that are configured to carry out the functions of embodiments of the present invention.
  • Program/utility 40 having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment.
  • Program modules 42 generally carry out the functions and/or methodologies of embodiments of the present invention as described herein.
  • Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc. ; one or more devices that enable a user to interact with computer system/server 12; and/or any devices (e. g. , network card, modem, etc. ) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet, computer system/server 12 can communicate with one or more networks such as a local area network (LAN) , a general wide area network (WAN) , and/or a public network (e. g. , the Internet) via network adapter 20.
  • LAN local area network
  • WAN wide area network
  • public network e. g. , the Internet
  • network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not illustrated, other hardware and/or software components could be used in conjunction with computer system/server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
  • Fig. 2 schematically shows a diagram 200 of the data structure of a genome obtained from sequencing an organism.
  • reference numeral 210 shows a schematic view of a chromosome
  • reference numeral 220 shows a schematic view of a genome.
  • a genome of an organism may be described by accurate arrangement of base pairs of deoxyribonucleic acid (DNA) .
  • the genome may be represented by an ordered sequence constructed by A, G, T and C four bases.
  • Genomes of different organisms have different lengths.
  • human genomes consist of about 3 billion base pairs (i. e. , 6 billion characters) , while genomes of other organisms may have different lengths.
  • Fig. 3 schematically shows a schematic view 300 of a method for genome compression according to one embodiment.
  • a genome 310 is a to-be-compressed genome
  • a reference genome 320 is a “standard genome” serving as alignment basis.
  • An alignce may be made between the to-be-compressed genome 310 and the reference genome 320, and only difference data 330 between genome 310 and reference genome 320 are saved in a compressed genome.
  • the present invention proposes a method for genome compression.
  • the method comprises: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome’s multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome at least comprising the index and the difference data.
  • Fig. 4 schematically shows a schematic view 400 of a method for genome compression according to one embodiment of the present invention.
  • a reference genome that matches the genome is selected from a reference database.
  • multiple reference genomes are stored in the reference database here, and these reference genomes may come from multiple samples of multiple species, such as multiple reference genomes from different races (the white race, the yellow race, the brown race and the black race) , and multiple reference genomes from various refined categories of other creatures. Since genomes of the same species have a higher similarity (i. e.
  • reference database mentioned in the present invention may further be enriched as new to-be-compressed genomes are processed. Detailed description will be presented below in this regard.
  • an index is built based on positions of the reference genome’s multiple segments in the reference genome. Since a genome usually consists of billions of characters, an index may further be built in order to locate specific positions in the genome much quickly. An index may be built according to multiple segments in the reference genome.
  • a segment refers to bases between the starting position and the ending position in the genome. For example, at1g33500: 1-10000 represents the segment is named atlg33500, and the starting and ending positions of bases in the segment are 1 and 10000 respectively.
  • segments are defined according to biological functions of various bases in the genome, or segments are defined in other manners. Detailed description will be presented below in this regard.
  • step S406 the genome is aligned with the reference genome based on the multiple segments, so as to identify difference data between the genome and the reference genome. Since a genome consists of a huge amount of bases, by taking each segment among the multiple segments as a unit, the base sequence in each segment of the reference genome is aligned with the to-be-compressed genome; when a portion that matches the segment is found in the to-be-compressed genome, only differences between the portion and a character sequence in the segment are recorded.
  • a compressed genome is generated, the compressed genome comprising at least the index and the difference data. Since the compressed genome does not include a base sequence that is the same as the reference genome, the space occupied by the compressed genome can be reduced greatly. When the reference database consists of only one reference genome, the compressed genome does not have to include an identifier of the reference genome; when the reference database consists of multiple reference genomes, the compressed genome should include identifiers of these reference genomes, so that it can be found through the identifiers which reference genome is used in compression.
  • new reference genomes may be added to the reference database gradually; for example, the reference database may be gradually updated during genome compression.
  • the reference database may be gradually updated during genome compression.
  • genome A when no reference genome with a higher similarity can be found in the reference database, it may be considered that genome A may belong to a new species, and thus genome A may be added to a candidate list.
  • a clustering method may be used and the most representative to-be-compressed genome obtained from clustering may be added to the reference database.
  • the purpose of including the index in the compressed genome lies in when there is a need to only access bases in a specific position range in the compressed genome, a portion corresponding to the specific position range can be quickly found among the difference data by the index, and then partial decompression is conducted based on the reference genome and the corresponding portion among the difference data, rather than the whole genome being decompressed and then a specified position range being found therein.
  • the selecting from a reference database a reference genome that matches the genome comprises: selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence in reference genomes in the reference database.
  • a reference genome that is similar to the to-be-compressed genome can be selected by comparing phenotypic traits of the to-be-compressed genome and each reference genome.
  • the selecting the reference genome comprises: calculating a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and selecting the reference genome with the first similarity larger than a first threshold.
  • an Euclidean distance between a vector V1 describing the phenotypic trait of the to-be-compressed genome and a vector V2 describing a phenotypic trait of a reference genome in the reference database is calculated and used as the first similarity.
  • a higher weight may be assigned to the phenotypic trait.
  • the reference genome with the first similarity larger than a first threshold may be selected; or when there exist multiple reference genomes each having a similarity larger than the first threshold, then the reference genome with the higher similarity may be selected. Those skilled in the art may further adopt other approaches to selecting the reference genome.
  • the selecting the reference genome comprises: with respect to a current reference genome in the reference database, determining a first position set of the at least one predefined sequence in the genome, and determining a second position set of the at least one predefined sequence in the current reference genome; calculating a second similarity between the first position set and the second position set; and selecting the reference genome based on the second similarity.
  • the reference genome may be selected based on the similarity between positions of the predefined sequence in the to-be-compressed genome and the reference genome.
  • the predefined sequence may be a base sequence that only exerts little impact on the division of species. For example, since humans belong to mammals, human genomes include some conserved base sequence segments that are the same as lower mammals; although humans can further be categorized into the white race, the yellow race and other races, genomes of each race include these conserved base sequence segments.
  • positions of these conserved base sequences in various reference genomes can be stored in a structure as shown in Table 2 below.
  • a first position set e. g. , represented as a vector V SM1
  • a second position set e. g. , represented as a vector V sM2
  • a similarity between the two vectors is calculated whereby the reference genome is selected.
  • the reference genome may be selected based on an approach to calculating an Euclidean distance or other approaches.
  • the accuracy of selecting the reference genome based on positions of conversed base sequence segments is yet to be improved, so first multiple reference genomes each having a similarity larger than a specific threshold may be selected from the reference database, and then the most appropriate reference genome is selected from the multiple reference genomes.
  • the selecting the reference genome based on the second similarity comprises: adding to a candidate list a reference genome with the second similarity lager than a second threshold; and comparing the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
  • a Multiple Sequence Alignment may be used to compare the to-be-compressed genome with multiple candidate reference genomes.
  • the Multiple Sequence Alignment is an alignment of three or more biological sequences (protein, DNA, etc. ) .
  • the Multiple Sequence Alignment is an alignment of three or more biological sequences (protein, DNA, etc. ) .
  • multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
  • the reference database may further include annotation information associated with each reference genome.
  • the annotation information mentioned here may be, for example, annotation information that describes functions of a base sequence between certain starting and ending positions. For example, suppose a base sequence between a starting position of 1 and an ending position of 10000 is correlated to human skin color, then annotation may be added with respect to the base sequence between positions 1-10000, indicating this base portion is correlated to human skin color.
  • other types of annotation may be added to base sequences at other positions in the genome.
  • segments may be defined based on starting and ending positions of base sequences associated with these annotations.
  • other part without annotation information may be divided according to a predefined step-length. For example, division may be conducted according to a unit of 1000 bases. Or other predefined step may further be set.
  • Fig. 5 schematically shows a schematic view 500 of the process for building an index according to one embodiment of the present invention.
  • an annotation 1 520 and an annotation 2 522 represent two annotations of a reference genome 510 in a reference database.
  • Starting and ending positions of a base sequence corresponding to annotation 1 520 in the entire genome are position 1 540 and position 2 542, respectively. Therefore, the portion between position 1 540 and position 2 542 may act as one segment (e. g. , a segment N 530) .
  • starting and ending positions of a base sequence corresponding to annotation 2 522 in the entire genome are position 2 540 and a position 3 544, respectively. Therefore, the portion between position 2 542 and position 3 544 may act as another segment (e.
  • reference genome 510 may further be divided into segments according to a predefined step 524, whereby a segment N+2 534 is obtained. In this manner, entire reference genome 510 may be divided into multiple segments.
  • one of the multiple segments in the reference genome may act as a basic unit for alignment with the to-be-compressed genome; further, one sub-segment in a segment may act as a basic unit for alignment. Alignment with a sub-segment as the basic unit possibly helps to enhance the probability of matching but also might complicate the index.
  • the aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome comprises: with respect to a sub-segment of a current segment among the multiple segments, looking up in the genome a core area that is similar to text of the sub-segment; taking text difference between the core area and the sub-segment as at least one part of the difference data; and adding to the difference data other part than the core area in the genome.
  • Figs. 6A to 6C schematically show respective schematic views 600A to 600C for identifying difference data between the genome and the reference genome according to one embodiment of the present invention.
  • Each of the multiple segments may be aligned with the to-be-compressed genome. Description is presented below to the process regarding how to align one segment with the to-be-compressed genome only.
  • the comparison may be made based on an n-gram and using a sliding window. Since the genome is a character sequence consisting of A, G, T and C four bases and with billions of magnitude orders of length, an analysis may be conducted by means of n-gram in a Probabilistic Language Model. For more details of the n-gram, reference may be made to http: //en. wikipedia. org/wiki/N-gram, which is not detailed in this specification.
  • an area whose sum is larger than a predefined threshold is used as the core area.
  • a 3-gram i. e. , alignment is made with 3 bases as the basic unit
  • a score of each n-gram is calculated based on a BLOSUM matrix in this embodiment.
  • Fig. 6A shows a score calculated with respect to each of 3-grams in a base sequence “ATGCGT....”
  • scores of these four basic units 3-gram 1 to 3-gram 4 are 13, 16, 14 and 18, respectively.
  • Fig. 6C shows a concrete example of the difference data.
  • the base in a to-be-compressed genome 610C differs from the base in a current segment 620C (i. e.
  • a difference shown in block 622C in Fig. 6C it may be represented as (d, T, 9) , where “d” represents a delete-type difference, “T” represents deleting a base “T, ” and the difference appears in the 9 th base after the last difference.
  • d represents a delete-type difference
  • T represents deleting a base “T, ”
  • the difference appears in the 9 th base after the last difference.
  • those skilled in the art may further define an insert-type difference.
  • the difference data may be saved in a manner associated with the current segment.
  • the index may include an association between the difference data and the current segment, i. e. , the association may represent to which segment in the reference genome the difference data corresponds.
  • an identifier of a segment associated with difference data may be added to the header of the difference data. For example, suppose difference data (d, T, 9) shown in block 622C in Fig. 6C are associated with a segment “seg1” in the reference genome, then the difference data may be recorded as “seg1 (d, T, 9) ” .
  • Note here given is only an example of the representation of difference data, and those skilled in the art may further use other data structure to record difference data, e. g. , recording in a 4-tuple form.
  • difference data when difference data correspond to a core area in a to-be-processed genome which is similar to text of a sub-segment of a current segment (i. e. , difference data inside a core area) , the current segment may be used as a segment associated with the difference data.
  • difference data are other data than various core areas in a to-be-processed genome (i. e. , difference data outside core areas)
  • a segment corresponding to a core area preceding (or following) the difference data may be used as a segment associated with the difference data.
  • a correspondence relationship among difference data and segments in the reference genome may be recorded explicitly. Based on the correspondence relationship and index, a specific portion in the compressed genome may be decompressed conveniently.
  • each segment (or each sub-segment of a segment) in the reference genome may be aligned with the to-be-compressed genome so as to find a corresponding core area and record text difference between each core area and a corresponding current segment (or sub-segment of a segment) .
  • the core area is expanded forward and/or backward in the genome; and in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, the expanded core area is used as an expanded core area (final matching area) .
  • the core area may further be expanded forward and/or backward in the to-be-compressed genome.
  • the core area is expanded with one base as a step-length at a time.
  • a comparison is made as to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area; when the difference is lower than a predefined threshold, the core area is expanded. Note the expansion should not be implemented without limit but aims to enhance the compression ratio.
  • a method for genome decompression comprising: in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtaining from a reference database a reference genome that matches the compressed genome; and decompressing, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  • Fig. 7 schematically shows a flowchart 700 of a method for decompressing a compressed genome according to one embodiment of the present invention.
  • step S702 in response to receiving a compressed genome that has been compressed according to a method of the present invention, a reference genome that matches the compressed genome is obtained from a reference database. Since an index of the compressed genome saves information of the reference genome similar to the genome, the reference genome may be obtained via the information from the reference database.
  • step S704 the compressed genome is decompressed, according to the index in the compressed genome, based on difference data between the reference genome and the compressed genome.
  • the difference data may be applied to the reference genome so as to restore a raw genome from the compressed genome.
  • the index indicates respective portions in the difference data corresponds to which segment or segments in the reference genome. By this manner, a certain portion in the compressed genome may be decompressed conveniently.
  • Fig. 8A schematically shows a block diagram 800A of an apparatus for genome compression according to one embodiment of the present invention.
  • an apparatus for genome compression comprising: a selecting module 810A configured to select from a reference database a reference genome that matches the genome; an indexing module 820A configured to build an index based on positions of the reference genome’s multiple segments in the reference genome; an aligning module 830A configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and a generating module 840A configured to generate a compressed genome, the compressed genome comprising at least the index and the difference data.
  • selecting module 810A comprises at least one of: a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database; and a selecting module configured to select the reference genome based on at least one predefined sequence included in reference genomes in the reference database.
  • the first selecting module comprises: a calculating module configured to calculate a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and a first selecting unit configured to select the reference genome with the first similarity larger than a first threshold.
  • the second selecting module comprises: a position determining module configured to, with respect to a current reference genome in the reference database, determine a first position set of the at least one predefined sequence in the genome, and determine a second position set of the at least one predefined sequence in the current reference genome; a position similarity calculating module configured to calculate a second similarity between the first position set and the second position set; and a second selecting module configured to select the reference genome based on the second similarity.
  • the second selecting module comprises: a candidate list generating module configured to add to a candidate list a reference genome with the second similarity lager than a second threshold; and a multiple sequence comparing module configured to compare the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
  • the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
  • aligning module 830A comprises: a core area generating module configured to, with respect to a sub-segment of a current segment among the multiple segments, look up in the genome a core area that is similar to text of the sub-segment; a first difference data generating module configured to take text difference between the core area and the sub-segment as at least one part of the difference data; and a second difference data generating module configured to add to the difference data other part than the core area in the genome.
  • the core area generating module further comprises: a first expanding module configured to, with respect to the sub-segment of the current segment among the multiple segments, expand the core area forward and/or backward in the genome; and a second expanding module configured to, in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, use the expanded core area as an expanded core area.
  • Fig. 8B schematically shows a block diagram 800B of an apparatus for decompressing a compressed genome according to one embodiment of the present invention.
  • an apparatus for genome decompression comprising: an obtaining module 810B configured to, in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtain from a reference database a reference genome that matches the compressed genome; and a decompressing module 820B configured to decompress, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  • a locating module configured to, in response to a request for access to a specified portion in the compressed genome, search for difference data corresponding to the specified portion in the difference data according to the index; and a partial decompressing module configured to decompress the specified portion based on the difference information and the reference genome.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s) .
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks illustrated in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a method and apparatus for genome compression and decompression. In one embodiment of the present invention, there is provided a method for genome compression, comprising: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome's multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome comprising at least the index and the difference data. In other embodiments, there is provided an apparatus for genome compression. Further, there are provided a method and apparatus for decompressing the genome that has been compressed using the above method and apparatus. By means of the technical solution of the present invention, the data compression ratio can be enhanced, and a specified position in the genome can be accessed without a need to decompress the entire genome.

Description

GENOME COMPRESSION AND DECOMPRESSION FIELD
Various embodiments of the present invention relate to data compression and decompression, and more specifically, to a method and apparatus for genome compression and decompression.
BACKGROUND
With the development of biology, research on biological genes has gone deeper and deeper, e. g. , into various aspects such as human health, medicine research &development, new plant and animal species and microorganism.
In short, sequencing biological genomes refers to recording a sequence of base pairs composing the chromosome of the organism. Usually the process of measuring a genome of the first sample of a species is referred to as sequencing, while the process of measuring a genome of other sample of the species is referred to as re-sequencing. A breakthrough has been achieved in sequencing and re-sequencing technologies, with various involved costs going increasingly lower. More and more individuals and/or organizations come to realize the significance of genomes, and so far genome data of a large amount of species have been obtained through sequencing/re-sequencing process.
Human genes comprise about 3 billion base pairs; according to existing representation modes, human genomes consist of about 6 billion characters (characters A, G, T and C) . Therefore, storing each genome takes up much storage space. When there is a need to store a large amount of genomes or to copy and transmit genomes, there comes up a challenge regarding how to enhance the data storage/data transmission efficiency.
SUMMARY
Biologists have found there is certain similarity among genomes of various samples of the same species. For example, the similarity among human genomes is much higher than the similarity between genomes of humans and other species; further, the similarity among genomes of the yellow race is usually higher than the similarity between genomes of the yellow race and the white race.
Therefore, it is desired to develop a technical solution for compressing/decompressing a genome based on the similarity among genomes. It is desired that the technical solution can be integrated with existing genome storage modes and make full use of the similarity among genomes and further achieve efficient compression/decompression; in addition, while effectively enhancing the data compression ratio, it is further desired that decompression can be implemented with respect to only a portion of the genome rather than decompressing the entire genome.
In one embodiment of the present invention, there is provided a method for genome compression, comprising: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome’s multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome comprising at least the index and the difference data.
In one embodiment of the present invention, the selecting from a reference database a reference genome that matches the genome comprises: selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence in reference genomes in the reference database.
In one embodiment of the present invention, the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length. If annotation information associated with the reference genome can be obtained, then the information is considered in preference.
In one embodiment of the present invention, there is provided a method for genome decompression, comprising: in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtaining from a reference database a reference genome that matches the compressed genome; and decompressing, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
In one embodiment of the present invention, there is provided an apparatus for genome compression, comprising: a selecting module configured to select from a reference database a reference genome that matches the genome; an indexing module configured to build an index based on positions of the reference genome’s multiple segments in the reference genome; an aligning module configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and a generating module configured to generate a compressed genome, the compressed genome comprising at least the index and the difference data.
In one embodiment of the present invention, the selecting module comprises at least one of: a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database; and a second selecting module configured to select the reference  genome based on at least one predefined sequence in reference genomes in the reference database.
In one embodiment of the present invention, the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
In one embodiment of the present invention, there is provided an apparatus for genome decompression, comprising: an obtaining module configured to, in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtain from a reference database a reference genome that matches the compressed genome; and a decompressing module configured to decompress, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
By means of the technical solution according to the embodiments of the present invention, a representative genome may be used as a reference genome; when storing a new to-be-processed genome, only difference between the to-be-processed genome and the reference genome is saved, thereby reducing the amount of data significantly. On the other hand, with the technical solution according to the embodiments of the present invention, where a compressed genome includes an index, any base pair in the genome can be found rapidly by querying the index, and further a gene segment desired to be accessed can be found rapidly without decompressing the entire compressed genome.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
Through the more detailed description of some embodiments of the present disclosure in the accompanying drawings, the above and other objects, features and advantages of the present disclosure will become more apparent, wherein the same reference generally refers to the same components in the embodiments of the present disclosure.
Fig. 1 schematically shows an exemplary computer system which is applicable to implement the embodiments of the present invention;
Fig. 2 schematically shows a diagram of the data structure of a genome obtained from sequencing an organism;
Fig. 3 schematically shows a schematic view of a method for genome compression according to one embodiment;
Fig. 4 schematically shows a schematic view of a method for genome compression according to one embodiment of the present invention;
Fig. 5 schematically shows a schematic view of the process for building an index according to the embodiments of the present invention;
Figs. 6A to 6C schematically show respective schematic views for identifying difference data between a genome and a reference genome according to one embodiment of the present invention, respectively;
Fig. 7 schematically shows a flowchart of a method for decompressing a compressed genome according to one embodiment of the present invention; and
Fig. 8A schematically shows a block diagram of an apparatus for genome compression according to one embodiment of the present invention, and Fig. 8B schematically shows a block diagram of an apparatus for decompressing a compressed genome according to one embodiment of the present invention.
DETAILED DESCRIPTION
Some preferable embodiments will be described in more detail with reference to the accompanying drawings, in which the preferable embodiments of the present disclosure have been illustrated. However, the present disclosure can be implemented in various manners, and thus should not be construed to be limited to the embodiments disclosed herein. On the contrary, those embodiments are provided for the thorough and complete understanding of the present disclosure, and completely conveying the scope of the present disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or one embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, in some embodiments, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium (s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (anon-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable  programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carder wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, an electro-magnetic signal, optical signal, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc. , or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Intemet using an Intemet Service Provider) .
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instruction means which implements the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring now to Fig. 1, in which a block diagram of an exemplary computer system/server 12 which is applicable to implement the embodiments of the present invention is illustrated. Computer system/server 12 illustrated in Fig. 1 is only illustrative and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein.
As illustrated in Fig. 1, computer system/server 12 is illustrated in the form of a general-purpose computing device. The components of computer system/server 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12, and it includes both volatile and non-volatile media, removable and non-removable media.
System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other  removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not illustrated in Fig. 1 and typically called a “hard drive” ) . Although not illustrated in Fig. 1, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e. g. , a “floppy disk” ) , and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each drive can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e. g. , at least one) of program modules that are configured to carry out the functions of embodiments of the present invention.
Program/utility 40, having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the present invention as described herein.
Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc. ; one or more devices that enable a user to interact with computer system/server 12; and/or any devices (e. g. , network card, modem, etc. ) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet, computer system/server 12 can communicate with one or more networks such as a local area network (LAN) , a general wide area network (WAN) , and/or a public network (e. g. , the Internet) via network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not illustrated, other hardware and/or software components could be used in conjunction with computer system/server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
Fig. 2 schematically shows a diagram 200 of the data structure of a genome obtained from sequencing an organism. In this figure, reference numeral 210 shows a schematic view of a chromosome, and reference numeral 220 shows a schematic view of a genome. In short, a genome of an organism may be described by accurate arrangement of base pairs of deoxyribonucleic acid (DNA) . In other words, the genome may be represented by an ordered sequence constructed by A, G, T and C four bases. Genomes of different organisms have different lengths. For  example, human genomes consist of about 3 billion base pairs (i. e. , 6 billion characters) , while genomes of other organisms may have different lengths.
Fig. 3 schematically shows a schematic view 300 of a method for genome compression according to one embodiment. Currently there have been proposed methods for genome compression by looking for differences between a current genome and a reference genome. As shown in Fig. 3, a genome 310 is a to-be-compressed genome, while a reference genome 320 is a “standard genome” serving as alignment basis. An aligmment may be made between the to-be-compressed genome 310 and the reference genome 320, and only difference data 330 between genome 310 and reference genome 320 are saved in a compressed genome.
With the development of network technologies, there already exist many organizations that can provide reference genomes, and these reference genomes can be accessed conveniently via networks. According to the genome compression method as shown in Fig. 3, by transmitting only difference data (e. g. , difference data 330) between the genome and the reference genome during genome transmission, raw data of genome 310 can be obtained based on transmitted difference data 330 and reference genome 320 obtained from network access.
Although the above method can enhance the data compression efficiency to a given extent, there still exist the following drawbacks: on one hand, it is difficult to effectively select from multiple existing reference genomes a reference genome that best matches the to-be-compressed genome; on the other hand, difference data are compressed as a whole in order to achieve a higher compression ratio, whereas when it is desired to only access a base pair at a specific position in the genome, the raw genome must be decompressed before locating the specific base pair.
In view of these drawbacks in the above technical solution, the present invention proposes a method for genome compression. The method comprises: selecting from a reference database a reference genome that matches the genome; building an index based on positions of the reference genome’s multiple segments in the reference genome; aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and generating a compressed genome, the compressed genome at least comprising the index and the difference data.
Fig. 4 schematically shows a schematic view 400 of a method for genome compression according to one embodiment of the present invention. First of all, in step S402 a reference genome that matches the genome is selected from a reference database. Note multiple reference genomes are stored in the reference database here, and these reference genomes may come from multiple samples of multiple species, such as multiple reference genomes from different races (the white race, the yellow  race, the brown race and the black race) , and multiple reference genomes from various refined categories of other creatures. Since genomes of the same species have a higher similarity (i. e. , text similarity among base characters in genomes) , providing a reference database comprising abundant reference genomes helps to find a reference genome that better matches a to-be-compressed genome, so as to further enhance the data compression ratio. In the context of the present invention, “to match” represents that two genomes have a higher similarity.
In addition, the reference database mentioned in the present invention may further be enriched as new to-be-compressed genomes are processed. Detailed description will be presented below in this regard.
In step S404, an index is built based on positions of the reference genome’s multiple segments in the reference genome. Since a genome usually consists of billions of characters, an index may further be built in order to locate specific positions in the genome much quickly. An index may be built according to multiple segments in the reference genome. In the context of the present invention, a segment refers to bases between the starting position and the ending position in the genome. For example, at1g33500: 1-10000 represents the segment is named atlg33500, and the starting and ending positions of bases in the segment are 1 and 10000 respectively.
In the context of the present invention, for the sake of convenience, segments are defined according to biological functions of various bases in the genome, or segments are defined in other manners. Detailed description will be presented below in this regard.
In step S406, the genome is aligned with the reference genome based on the multiple segments, so as to identify difference data between the genome and the reference genome. Since a genome consists of a huge amount of bases, by taking each segment among the multiple segments as a unit, the base sequence in each segment of the reference genome is aligned with the to-be-compressed genome; when a portion that matches the segment is found in the to-be-compressed genome, only differences between the portion and a character sequence in the segment are recorded.
Finally in step S408, a compressed genome is generated, the compressed genome comprising at least the index and the difference data. Since the compressed genome does not include a base sequence that is the same as the reference genome, the space occupied by the compressed genome can be reduced greatly. When the reference database consists of only one reference genome, the compressed genome does not have to include an identifier of the reference genome; when the reference database consists of multiple reference genomes, the compressed genome should include identifiers of these reference genomes, so that it can be found through the identifiers which reference genome is used in compression.
In addition, new reference genomes may be added to the reference database gradually; for example, the reference database may be gradually updated during genome compression. Specifically, with respect to a newly inputted genome A, when no reference genome with a higher similarity can be found in the reference database, it may be considered that genome A may belong to a new species, and thus genome A may be added to a candidate list. When genomes in the candidate list amount to a certain number, a clustering method may be used and the most representative to-be-compressed genome obtained from clustering may be added to the reference database.
In addition, the purpose of including the index in the compressed genome lies in when there is a need to only access bases in a specific position range in the compressed genome, a portion corresponding to the specific position range can be quickly found among the difference data by the index, and then partial decompression is conducted based on the reference genome and the corresponding portion among the difference data, rather than the whole genome being decompressed and then a specified position range being found therein.
In one embodiment of the present invention, the selecting from a reference database a reference genome that matches the genome comprises: selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence in reference genomes in the reference database.
Various approaches may be used to find in a reference database a reference genome that matches the genome. Specifically, the reference database may further include additional information describing at least one phenotypic trait of each reference genome, and the phenotypic trait may include multiple aspects, such as skin color, hair color for humans. Therefore, the phenotypic trait characterizing each reference genome may be described by a multi-dimensional vector VPT= (pt1, pt2, ... ) . In addition, 10 levels from 1 to 10 may be set to describe colors from white to black. Therefore, the multi-dimensional vector may be represented as VPT= (2, 3, ... ) . Phenotypic traits in the reference database may be stored in a format as shown in Table 1 below.
Table 1 Phenotypic Trait
Reference Genome No. Skin Color Hair Color ...
1 2 3 ...
2 3 9 ...
... ... ... ...
Since phenotypic traits of the to-be-compressed genome can be collected, a reference genome that is similar to the to-be-compressed genome can be selected by  comparing phenotypic traits of the to-be-compressed genome and each reference genome. In one embodiment of the present invention, the selecting the reference genome comprises: calculating a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and selecting the reference genome with the first similarity larger than a first threshold.
Those skilled in the art may adopt various approaches to calculating the similarity. For example, an Euclidean distance between a vector V1 describing the phenotypic trait of the to-be-compressed genome and a vector V2 describing a phenotypic trait of a reference genome in the reference database is calculated and used as the first similarity. Alternatively, if the importance of a certain phenotypic trait is considered to be higher, a higher weight may be assigned to the phenotypic trait.
The reference genome with the first similarity larger than a first threshold may be selected; or when there exist multiple reference genomes each having a similarity larger than the first threshold, then the reference genome with the higher similarity may be selected. Those skilled in the art may further adopt other approaches to selecting the reference genome.
In one embodiment of the present invention, the selecting the reference genome comprises: with respect to a current reference genome in the reference database, determining a first position set of the at least one predefined sequence in the genome, and determining a second position set of the at least one predefined sequence in the current reference genome; calculating a second similarity between the first position set and the second position set; and selecting the reference genome based on the second similarity.
If it is impossible to select the reference genome based on phenotypic traits, then the reference genome may be selected based on the similarity between positions of the predefined sequence in the to-be-compressed genome and the reference genome. In the context of the present invention, the predefined sequence may be a base sequence that only exerts little impact on the division of species. For example, since humans belong to mammals, human genomes include some conserved base sequence segments that are the same as lower mammals; although humans can further be categorized into the white race, the yellow race and other races, genomes of each race include these conserved base sequence segments.
Nowadays biologists have successfully identified conserved base sequence segments that are correlated to each species, by comparing the similarity between positions of these conserved base sequence segments in the to-be-compressed genome and the reference genome, a species to which the to-be-compressed genome belongs can be inferred approximately, and further it helps to select a reference genome that is more similar to the to-be-compressed genome.
For humans, suppose multiple conserved base sequence segments have been identified, positions of these conserved base sequences in various reference genomes can be stored in a structure as shown in Table 2 below.
Table 2 Conserved Base Sequence Segments
Figure PCTCN2014088400-appb-000001
Like the above concrete example shown with reference to phenotypic traits, positions of multiple conserved base sequence segments in one genome may be described by a vector, e. g. , VsM= (position 1, position 2, ...) . A first position set (e. g. , represented as a vector VSM1) of the multiple conserved base sequence segments in the genome is determined, a second position set (e. g. , represented as a vector VsM2) of the multiple conserved base sequence segments in the reference genome is also determined, and a similarity between the two vectors is calculated whereby the reference genome is selected.
Like the above approach to selecting the reference genome based on phenotypic traits, in one embodiment of the present invention, the reference genome may be selected based on an approach to calculating an Euclidean distance or other approaches.
The accuracy of selecting the reference genome based on positions of conversed base sequence segments is yet to be improved, so first multiple reference genomes each having a similarity larger than a specific threshold may be selected from the reference database, and then the most appropriate reference genome is selected from the multiple reference genomes.
In one embodiment of the present invention, the selecting the reference genome based on the second similarity comprises: adding to a candidate list a reference genome with the second similarity lager than a second threshold; and comparing the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
In one embodiment of the present invention, a Multiple Sequence Alignment (MSA) may be used to compare the to-be-compressed genome with multiple candidate reference genomes. The Multiple Sequence Alignment is an alignment of three or more biological sequences (protein, DNA, etc. ) . For more details of the Multiple Sequence Alignment, reference may be made to  http: //en. wikipedia. org/wiki/Multiple_sequence_alignment, which is not detailed in this specification.
In one embodiment of the present invention, multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
In addition to the above phenotypic traits and conserved base sequence segments, the reference database may further include annotation information associated with each reference genome. The annotation information mentioned here may be, for example, annotation information that describes functions of a base sequence between certain starting and ending positions. For example, suppose a base sequence between a starting position of 1 and an ending position of 10000 is correlated to human skin color, then annotation may be added with respect to the base sequence between positions 1-10000, indicating this base portion is correlated to human skin color. In addition, other types of annotation may be added to base sequences at other positions in the genome.
So far biologists have cracked definitions of some base sequences and added lots of annotations to genomes. Therefore, segments may be defined based on starting and ending positions of base sequences associated with these annotations. In addition, since annotations are added to only one part of base sequences in genomes, other part without annotation information may be divided according to a predefined step-length. For example, division may be conducted according to a unit of 1000 bases. Or other predefined step may further be set.
Fig. 5 schematically shows a schematic view 500 of the process for building an index according to one embodiment of the present invention. As shown in Fig. 6, an annotation 1 520 and an annotation 2 522 represent two annotations of a reference genome 510 in a reference database. Starting and ending positions of a base sequence corresponding to annotation 1 520 in the entire genome are position 1 540 and position 2 542, respectively. Therefore, the portion between position 1 540 and position 2 542 may act as one segment (e. g. , a segment N 530) . In addition, starting and ending positions of a base sequence corresponding to annotation 2 522 in the entire genome are position 2 540 and a position 3 544, respectively. Therefore, the portion between position 2 542 and position 3 544 may act as another segment (e. g. , a segment N+1 532) . Similarly, other portion without annotation information in reference genome 510 may further be divided into segments according to a predefined step 524, whereby a segment N+2 534 is obtained. In this manner, entire reference genome 510 may be divided into multiple segments.
In one embodiment of the present invention, one of the multiple segments in the reference genome may act as a basic unit for alignment with the to-be-compressed genome; further, one sub-segment in a segment may act as a basic  unit for alignment. Alignment with a sub-segment as the basic unit possibly helps to enhance the probability of matching but also might complicate the index.
In one embodiment of the present invention, the aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome comprises: with respect to a sub-segment of a current segment among the multiple segments, looking up in the genome a core area that is similar to text of the sub-segment; taking text difference between the core area and the sub-segment as at least one part of the difference data; and adding to the difference data other part than the core area in the genome.
Figs. 6A to 6C schematically show respective schematic views 600A to 600C for identifying difference data between the genome and the reference genome according to one embodiment of the present invention. Each of the multiple segments may be aligned with the to-be-compressed genome. Description is presented below to the process regarding how to align one segment with the to-be-compressed genome only.
Specifically, the comparison may be made based on an n-gram and using a sliding window. Since the genome is a character sequence consisting of A, G, T and C four bases and with billions of magnitude orders of length, an analysis may be conducted by means of n-gram in a Probabilistic Language Model. For more details of the n-gram, reference may be made to http: //en. wikipedia. org/wiki/N-gram, which is not detailed in this specification.
In one embodiment of the present invention, based on a sum of scores corresponding to multiple n-grams in a current segment, an area whose sum is larger than a predefined threshold is used as the core area. Suppose a 3-gram (i. e. , alignment is made with 3 bases as the basic unit) is used in one embodiment, and a score of each n-gram is calculated based on a BLOSUM matrix in this embodiment. Fig. 6A shows a score calculated with respect to each of 3-grams in a base sequence “ATGCGT....” Specifically, scores of these four basic units 3-gram 1 to 3-gram 4 are 13, 16, 14 and 18, respectively.
Fig. 6B shows how to calculate a score indicating whether the to-be-compressed genome is similar to a sub-segment in the current segment. Take 3-grams for example. When scores of 3-grams in the to-be-compressed genome and in the sub-segment are the same, the total score is +2; when the scores are different, the total score is -3. By comparing a to-be-compressed genome 610B with a sub-segment in a current segment 612B in Fig. 6B, the total score =2+2+2+2+2+2+2-3+2+2+2+2+2+2+2-3+2=24. When the total score exceeds a predefined threshold, it may be considered that the base sequence in the to-be-compressed genome is a core area that is similar to text of the sub-segment.
After finding the core area that is similar to text of the sub-segment of the current segment, text difference between the core area and the sub-segment of the current segment is looked for, and the found text difference acts as one part of the difference data. Specifically, Fig. 6C shows a concrete example of the difference data. As shown in block 620C in Fig. 6C, the base in a to-be-compressed genome 610C differs from the base in a current segment 620C (i. e. , there exists text difference) , and the difference may be recorded as (c, A, 15) , where “c” represents a change-type difference, “A” represents changing the base in the reference genome to a base “A” , and the difference appears in the 15th base.
Similarly, for a difference shown in block 622C in Fig. 6C, it may be represented as (d, T, 9) , where “d” represents a delete-type difference, “T” represents deleting a base “T, ” and the difference appears in the 9th base after the last difference. Similarly, those skilled in the art may further define an insert-type difference.
Note the difference data may be saved in a manner associated with the current segment. For example, the index may include an association between the difference data and the current segment, i. e. , the association may represent to which segment in the reference genome the difference data corresponds. Specifically, an identifier of a segment associated with difference data may be added to the header of the difference data. For example, suppose difference data (d, T, 9) shown in block 622C in Fig. 6C are associated with a segment “seg1” in the reference genome, then the difference data may be recorded as “seg1 (d, T, 9) ” . Note here given is only an example of the representation of difference data, and those skilled in the art may further use other data structure to record difference data, e. g. , recording in a 4-tuple form.
In one embodiment of the present invention, when difference data correspond to a core area in a to-be-processed genome which is similar to text of a sub-segment of a current segment (i. e. , difference data inside a core area) , the current segment may be used as a segment associated with the difference data. In addition, when difference data are other data than various core areas in a to-be-processed genome (i. e. , difference data outside core areas) , a segment corresponding to a core area preceding (or following) the difference data may be used as a segment associated with the difference data. In this manner, a correspondence relationship among difference data and segments in the reference genome may be recorded explicitly. Based on the correspondence relationship and index, a specific portion in the compressed genome may be decompressed conveniently.
With the example described above, each segment (or each sub-segment of a segment) in the reference genome may be aligned with the to-be-compressed genome so as to find a corresponding core area and record text difference between each core area and a corresponding current segment (or sub-segment of a segment) . For other portions than core areas in the to-be-compressed genome, it may be  considered that no base sequence that is similar to these portions exists in the reference genome, so these portions may be added to the difference data directly.
In one embodiment of the present invention, with respect to the sub-segment of the current segment among the multiple segments, the core area is expanded forward and/or backward in the genome; and in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, the expanded core area is used as an expanded core area (final matching area) .
Description has been presented above to how to find a core area. Alternatively, the core area may further be expanded forward and/or backward in the to-be-compressed genome. For example, the core area is expanded with one base as a step-length at a time. For example, a comparison is made as to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area; when the difference is lower than a predefined threshold, the core area is expanded. Note the expansion should not be implemented without limit but aims to enhance the compression ratio.
In one embodiment of the present invention, there is provided a method for genome decompression, comprising: in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtaining from a reference database a reference genome that matches the compressed genome; and decompressing, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
Fig. 7 schematically shows a flowchart 700 of a method for decompressing a compressed genome according to one embodiment of the present invention. Specifically, in step S702, in response to receiving a compressed genome that has been compressed according to a method of the present invention, a reference genome that matches the compressed genome is obtained from a reference database. Since an index of the compressed genome saves information of the reference genome similar to the genome, the reference genome may be obtained via the information from the reference database.
Next in step S704, the compressed genome is decompressed, according to the index in the compressed genome, based on difference data between the reference genome and the compressed genome. In addition, since difference data in the compressed genome saves difference data between the genome and the reference genome, the difference data may be applied to the reference genome so as to restore a raw genome from the compressed genome.
In one embodiment of the present invention, there is further comprised: in response to a request for access to a specified portion in the compressed genome, searching for difference data corresponding to the specified portion in the difference data according to the index; and decompressing the specified portion based on the difference information and the reference genome.
As described in the above procedure for building the index, one skilled in the art may understand the the index indicates respective portions in the difference data corresponds to which segment or segments in the reference genome. By this manner, a certain portion in the compressed genome may be decompressed conveniently.
Fig. 8A schematically shows a block diagram 800A of an apparatus for genome compression according to one embodiment of the present invention. Specifically, there is provided an apparatus for genome compression, comprising: a selecting module 810A configured to select from a reference database a reference genome that matches the genome; an indexing module 820A configured to build an index based on positions of the reference genome’s multiple segments in the reference genome; an aligning module 830A configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and a generating module 840A configured to generate a compressed genome, the compressed genome comprising at least the index and the difference data.
In one embodiment of the present invention, selecting module 810A comprises at least one of: a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database; and a selecting module configured to select the reference genome based on at least one predefined sequence included in reference genomes in the reference database.
In one embodiment of the present invention, the first selecting module comprises: a calculating module configured to calculate a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and a first selecting unit configured to select the reference genome with the first similarity larger than a first threshold.
In one embodiment of the present invention, the second selecting module comprises: a position determining module configured to, with respect to a current reference genome in the reference database, determine a first position set of the at least one predefined sequence in the genome, and determine a second position set of the at least one predefined sequence in the current reference genome; a position similarity calculating module configured to calculate a second similarity between the  first position set and the second position set; and a second selecting module configured to select the reference genome based on the second similarity.
In one embodiment of the present invention, the second selecting module comprises: a candidate list generating module configured to add to a candidate list a reference genome with the second similarity lager than a second threshold; and a multiple sequence comparing module configured to compare the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
In one embodiment of the present invention, the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
In one embodiment of the present invention, aligning module 830A comprises: a core area generating module configured to, with respect to a sub-segment of a current segment among the multiple segments, look up in the genome a core area that is similar to text of the sub-segment; a first difference data generating module configured to take text difference between the core area and the sub-segment as at least one part of the difference data; and a second difference data generating module configured to add to the difference data other part than the core area in the genome.
In one embodiment of the present invention, the core area generating module further comprises: a first expanding module configured to, with respect to the sub-segment of the current segment among the multiple segments, expand the core area forward and/or backward in the genome; and a second expanding module configured to, in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, use the expanded core area as an expanded core area.
Fig. 8B schematically shows a block diagram 800B of an apparatus for decompressing a compressed genome according to one embodiment of the present invention. Specifically, there is provided an apparatus for genome decompression, comprising: an obtaining module 810B configured to, in response to receiving a compressed genome that has been compressed according to a method of the present invention, obtain from a reference database a reference genome that matches the compressed genome; and a decompressing module 820B configured to decompress, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
In one embodiment of the present invention, there is further comprised: a locating module configured to, in response to a request for access to a specified portion in the compressed genome, search for difference data corresponding to the specified portion in the difference data according to the index; and a partial  decompressing module configured to decompress the specified portion based on the difference information and the reference genome.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s) . It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks illustrated in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (21)

  1. A method for genome compression, comprising:
    selecting from a reference database a reference genome that matches the genome;
    building an index based on positions of the reference genome’s multiple segments in the reference genome;
    aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and
    generating a compressed genome, the compressed genome comprising at least the index and the difference data.
  2. The method according to Claim 1, wherein the selecting from a reference database a reference genome that matches the genome comprises:
    selecting the reference genome based on at least one of at least one phenotypic trait characterizing reference genomes in the reference database and at least one predefined sequence included in reference genomes in the reference database.
  3. The method according to Claim 2, wherein the selecting the reference genome comprises:
    calculating a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and
    selecting the reference genome with the first similarity larger than a first threshold.
  4. The method according to Claim 2, wherein the selecting the reference genome comprises: with respect to a current reference genome in the reference database,
    determining a first position set of the at least one predefined sequence in the genome, and determining a second position set of the at least one predefined sequence in the current reference genome;
    calculating a second similarity between the first position set and the second position set; and
    selecting the reference genome based on the second similarity.
  5. The method according to Claim 4, wherein the selecting the reference genome based on the second similarity comprises:
    adding to a candidate list a reference genome with the second similarity lager than a second threshold; and
    comparing the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
  6. The method according to any of Claims 1-5, wherein the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
  7. The method according to any of Claims 1-5, wherein the aligning the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome comprises:
    with respect to a sub-segment of a current segment among the multiple segments,
    looking up in the genome a core area that is similar to text of the sub-segment;
    taking text difference between the core area and the sub-segment as at least one part of the difference data; and
    adding to the difference data other part than the core area in the genome.
  8. The method according to Claim 7, further comprising: with respect to the sub-segment of the current segment among the multiple segments,
    expanding the core area forward and/or backward in the genome; and
    in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, using the expanded core area as an expanded core area.
  9. A method for genome decompression, comprising:
    in response to receiving a compressed genome that has been compressed according to a method according to any of Claims 1-8, obtaining from a reference database a reference genome that matches the compressed genome; and
    decompressing, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  10. The method according to Claim 9, further comprising:
    in response to a request for access to a specified portion in the compressed genome, searching for difference data corresponding to the specified portion in the difference data according to the index; and
    decompressing the specified portion based on the difference information and the reference genome.
  11. An apparatus for genome compression, comprising:
    a selecting module configured to select from a reference database a reference genome that matches the genome;
    an indexing module configured to build an index based on positions of the reference genome’s multiple segments in the reference genome;
    an aligning module configured to align the genome with the reference genome based on the multiple segments so as to identify difference data between the genome and the reference genome; and
    a generating module configured to generate a compressed genome, the compressed genome comprising at least the index and the difference data.
  12. The apparatus according to Claim 11, wherein the selecting module comprises at least one of:
    a first selecting module configured to select the reference genome based on at least one phenotypic trait characterizing reference genomes in the reference database; and
    a second selecting module configured to select the reference genome based on at least one predefined sequence included in reference genomes in the reference database.
  13. The apparatus according to Claim 12, wherein the first selecting module comprises:
    a calculating module configured to calculate a first similarity between the at least one phenotypic trait characterizing the genome and at least one phenotypic trait characterizing a reference genome in the reference database; and
    a first selecting unit configured to select the reference genome with the first similarity larger than a first threshold.
  14. The apparatus according to Claim 12, wherein the second selecting module comprises:
    a position determining module configured to, with respect to a current reference genome in the reference database, determine a first position set of the at least one predefined sequence in the genome, and determine a second position set of the at least one predefined sequence in the current reference genome;
    a position similarity calculating module configured to calculate a second similarity between the first position set and the second position set; and
    a second selecting module configured to select the reference genome based on the second similarity.
  15. The apparatus according to Claim 14, wherein the second selecting unit comprises:
    a candidate list generating module configured to add to a candidate list a reference genome with the second similarity lager than a second threshold; and
    a multiple sequence comparing module configured to compare the genome with each reference genome in the candidate list so as to select from the candidate list a reference genome with a minimal difference than the genome.
  16. The apparatus according to any of Claims 11-15, wherein the multiple segments of the reference genome are defined based on at least one of annotation associated with the reference genome and a predefined step-length.
  17. The apparatus according to any of Claims 11-15, wherein the aligning module comprises:
    a core area generating module configured to, with respect to a sub-segment of a current segment among the multiple segments, look up in the genome a core area that is similar to text of the sub-segment;
    a first difference data generating module configured to take text difference between the core area and the sub-segment as at least one part of the difference data; and
    a second difference data generating module configured to add to the difference data other part than the core area in the genome.
  18. The apparatus according to Claim 17, wherein the core area generating module further comprises:
    a first expanding module configured to, with respect to the sub-segment of the current segment among the multiple segments, expand the core area forward and/or backward in the genome; and
    a second expanding module configured to, in response to text difference between the expanded core area and an area in the reference genome corresponding to the expanded core area being lower than a third threshold, use the expanded core area as an expanded core area.
  19. An apparatus for genome decompression, comprising:
    an obtaining module configured to, in response to receiving a compressed genome that has been compressed according to a method according to any of Claims 1-8, obtain from a reference database a reference genome that matches the compressed genome; and
    a decompressing module configured to decompress, according to an index in the compressed genome, the compressed genome based on difference data between the reference genome and the compressed genome.
  20. The apparatus according to Claim 19, further comprising:
    a locating module configured to, in response to a request for access to a specified portion in the compressed genome, search for difference data corresponding to the specified portion in the difference data according to the index; and
    a partial decompressing module configured to decompress the specified portion based on the difference information and the reference genome.
  21. A computer program comprising program code adapted to perform the method steps of any of claims 1 to 10 when said program is run on a computer.
PCT/CN2014/088400 2013-12-06 2014-10-11 Genome compression and decompression WO2015081754A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE112014005580.8T DE112014005580T5 (en) 2013-12-06 2014-10-11 Genome compression and decompression
US15/101,946 US10679727B2 (en) 2013-12-06 2014-10-11 Genome compression and decompression

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310655168.1 2013-12-06
CN201310655168.1A CN104699998A (en) 2013-12-06 2013-12-06 Method and device for compressing and decompressing genome

Publications (1)

Publication Number Publication Date
WO2015081754A1 true WO2015081754A1 (en) 2015-06-11

Family

ID=53272848

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/088400 WO2015081754A1 (en) 2013-12-06 2014-10-11 Genome compression and decompression

Country Status (4)

Country Link
US (1) US10679727B2 (en)
CN (1) CN104699998A (en)
DE (1) DE112014005580T5 (en)
WO (1) WO2015081754A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
WO2017158330A1 (en) * 2016-03-15 2017-09-21 Genomics Plc Compression/decompression method and apparatus for genomic variant call data
CN110168651A (en) * 2016-10-11 2019-08-23 基因组***公司 Method and system for selective access storage or transmission biological data
US10679727B2 (en) 2013-12-06 2020-06-09 International Business Machines Corporation Genome compression and decompression
US10854314B2 (en) 2014-05-15 2020-12-01 Codondex Llc Systems, methods, and devices for analysis of genetic material
US11017881B2 (en) 2014-05-15 2021-05-25 Codondex Llc Systems, methods, and devices for analysis of genetic material
US11515011B2 (en) 2019-08-09 2022-11-29 International Business Machines Corporation K-mer based genomic reference data compression

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10191929B2 (en) 2013-05-29 2019-01-29 Noblis, Inc. Systems and methods for SNP analysis and genome sequencing
US10560552B2 (en) * 2015-05-21 2020-02-11 Noblis, Inc. Compression and transmission of genomic information
CN105049055B (en) * 2015-06-30 2019-04-05 郑州宇通客车股份有限公司 A kind of data compression method and data decompressing method
CN107633158B (en) * 2016-07-18 2020-12-01 三星(中国)半导体有限公司 Method and apparatus for compressing and decompressing gene sequences
WO2018127821A1 (en) * 2017-01-06 2018-07-12 Codondex Llc Systems, methods, and devices for analysis of genetic material
US11222712B2 (en) 2017-05-12 2022-01-11 Noblis, Inc. Primer design using indexed genomic information
US11163726B2 (en) * 2017-08-31 2021-11-02 International Business Machines Corporation Context aware delta algorithm for genomic files
WO2019076177A1 (en) * 2017-10-20 2019-04-25 人和未来生物科技(长沙)有限公司 Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
CN109698703B (en) * 2017-10-20 2020-10-20 人和未来生物科技(长沙)有限公司 Gene sequencing data decompression method, system and computer readable medium
CN110958212B (en) * 2018-09-27 2022-04-12 阿里巴巴集团控股有限公司 Data compression method, data decompression method, device and equipment
CN111916155A (en) * 2019-05-08 2020-11-10 人和未来生物科技(长沙)有限公司 Method, system and medium for compressing and reducing gene data without reference gene sequence
CN110223732B (en) * 2019-05-17 2021-04-06 清华大学 Integration method of multi-class biological sequence annotation
US11922017B2 (en) 2021-04-27 2024-03-05 Apple Inc. Compact genome data storage with random access
CN113268461B (en) * 2021-07-19 2021-09-17 广州嘉检医学检测有限公司 Method and device for gene sequencing data recombination packaging
US20230229633A1 (en) * 2022-01-18 2023-07-20 Dell Products L.P. Adding content to compressed files using sequence alignment
US11977517B2 (en) 2022-04-12 2024-05-07 Dell Products L.P. Warm start file compression using sequence alignment
CN115270169B (en) * 2022-05-18 2023-06-13 蔓之研(上海)生物科技有限公司 Decompression method and system for gene data
WO2024077568A1 (en) * 2022-10-13 2024-04-18 深圳华大智造科技股份有限公司 Construction method for reference sequence, metagenome data compression method, and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169107A1 (en) * 2008-12-30 2010-07-01 Samsung Electronics Co., Ltd. Method and apparatus for integrated personal genome management
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与***科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8340914B2 (en) 2004-11-08 2012-12-25 Gatewood Joe M Methods and systems for compressing and comparing genomic data
US8116988B2 (en) 2006-05-19 2012-02-14 The University Of Chicago Method for indexing nucleic acid sequences for computer based searching
US20120089338A1 (en) 2009-03-13 2012-04-12 Life Technologies Corporation Computer implemented method for indexing reference genome
CN101914628B (en) * 2010-09-02 2013-01-09 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
AU2012272161B2 (en) 2011-06-21 2015-12-24 Illumina Cambridge Limited Methods and systems for data analysis
EP2595076B1 (en) 2011-11-18 2019-05-15 Tata Consultancy Services Limited Compression of genomic data
KR101922129B1 (en) 2011-12-05 2018-11-26 삼성전자주식회사 Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS)
US10353869B2 (en) * 2012-05-18 2019-07-16 International Business Machines Corporation Minimization of surprisal data through application of hierarchy filter pattern
CN104699998A (en) 2013-12-06 2015-06-10 国际商业机器公司 Method and device for compressing and decompressing genome

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169107A1 (en) * 2008-12-30 2010-07-01 Samsung Electronics Co., Ltd. Method and apparatus for integrated personal genome management
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与***科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10679727B2 (en) 2013-12-06 2020-06-09 International Business Machines Corporation Genome compression and decompression
US10854314B2 (en) 2014-05-15 2020-12-01 Codondex Llc Systems, methods, and devices for analysis of genetic material
US11017881B2 (en) 2014-05-15 2021-05-25 Codondex Llc Systems, methods, and devices for analysis of genetic material
WO2017158330A1 (en) * 2016-03-15 2017-09-21 Genomics Plc Compression/decompression method and apparatus for genomic variant call data
US11823774B2 (en) 2016-03-15 2023-11-21 Genomics, PLC Compression/decompression method and apparatus for genomic variant call data
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN110168651A (en) * 2016-10-11 2019-08-23 基因组***公司 Method and system for selective access storage or transmission biological data
US11515011B2 (en) 2019-08-09 2022-11-29 International Business Machines Corporation K-mer based genomic reference data compression

Also Published As

Publication number Publication date
CN104699998A (en) 2015-06-10
DE112014005580T5 (en) 2016-08-11
US10679727B2 (en) 2020-06-09
US20160306919A1 (en) 2016-10-20

Similar Documents

Publication Publication Date Title
WO2015081754A1 (en) Genome compression and decompression
Li Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
US10586609B2 (en) Managing gene sequences
US11649495B2 (en) Systems and methods for mitochondrial analysis
US10192026B2 (en) Systems and methods for genomic pattern analysis
US20230366046A1 (en) Systems and methods for analyzing viral nucleic acids
US20180247016A1 (en) Systems and methods for providing assisted local alignment
US20220359039A1 (en) Electronic Methods And Systems For Microorganism Characterization
WO2014115198A1 (en) Input support system, input support method and input support program
US20210193254A1 (en) Rapid Detection of Gene Fusions
WO2020258652A1 (en) Character replacement method and system, computer apparatus, and computer readable storage medium
CN117171308A (en) Method, device and medium for generating scientific research data analysis response information
CN107908724B (en) Data model matching method, device, equipment and storage medium
CN110378378B (en) Event retrieval method and device, computer equipment and storage medium
Si et al. Survey of gene splicing algorithms based on reads
RU2818363C1 (en) Fast detection of gene fusions
US20230214394A1 (en) Data search method and apparatus, electronic device and storage medium
CN115527612B (en) Genome second-fourth generation fusion assembly method and system based on numerical characteristic expression
US20190050531A1 (en) Dna sequence processing method and device
CN109545279B (en) Method, device, equipment and storage medium for analyzing chromosome microarray data
Braga et al. Family-Free Genome Comparison
CN110797087A (en) Sequencing sequence processing method and device, storage medium and electronic equipment
Papenfuss et al. Marsupial Sequencing Projects and Bioinformatics Challenges
CN116680389A (en) Session processing method, device, computer equipment and storage medium
Cantacessi High-throughput sequencing in veterinary diagnostics-a possible way forward.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14868389

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15101946

Country of ref document: US

Ref document number: 112014005580

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14868389

Country of ref document: EP

Kind code of ref document: A1