CN105224828A

CN105224828A - A kind of gene order fragment quick position key assignments index data compression method

Info

Publication number: CN105224828A
Application number: CN201510648867.2A
Authority: CN
Inventors: 宋卓; 李�根
Original assignee: Human And Future Biotechnology (changsha) Co Ltd
Current assignee: Human And Future Biotechnology (changsha) Co Ltd
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2016-01-06
Anticipated expiration: 2035-10-09
Also published as: CN105224828B

Abstract

The invention discloses a kind of gene order fragment quick position key assignments index data compression method, step comprises: 1) initialization compression result S set et _comp, the prefix length n that setting data compression adopts; 2) from gene sequence data S set et to be compressed _origthe current gene order fragment Key that middle taking-up one is to be compressed; 3) skew 0 that circulated respectively by current gene order fragment Key has the gene order fragment sequence Key of common prefix to (n-1) secondary formation n _r0, Key _r1..., Key _{r (n-1)}, n is prefix length, by all gene order fragment sequences based on common prefix and different circulations skew number of times and suffix add compression result S set et respectively _comp; 4) data acquisition Set to be compressed is judged _origwhether be empty, if non-NULL, take out current gene order fragment Key next to be compressed, and redirect perform step 2); Otherwise, by compression result S set et _compexport.The present invention can improve search efficiency during big data quantity, has that compressed capability is strong, take up room little advantage.

Description

A kind of gene order fragment quick position key assignments index data compression method

Technical field

The present invention relates to the bioinformatic analysis technology of gene sequencing data, be specifically related to a kind of gene order fragment quick position key assignments index data compression method.

Background technology

Sequencing sequence location technology is the basis of current high flux gene sequencing data analysis.Sequence fragment adopts the methods such as BWA to carry out the best string matching of tolerable partial error usually.But actual experiment shows, in most cases, the sequence fragment that major part order-checking obtains can be dispersed as shorter gene order fragment (36BP), and is mated completely accurately and fast by accurate Key-Value mapping method.

Mate in reference chain quickly and accurately in order to short gene order can be allowed, need first based on the data of reference chain, make Key-Value index data base, set up as follows: as reference chain data are: ACGTGCA, if the database of the key-value pair (Key-Value to) of short data records coupling need be built by 4 characters one group, as shown in Figure 1.See Fig. 1, with reference to chain data from back to front, character starts one by one, with 4 characters for length, can obtain 4 groups of Key-Value to the data as Query Database.If check order, the short data records that obtains is " GTGC ", is mapped by Key-Value, and can obtain the Offset (skew) that GTGC should be positioned at reference sequences is fast the position of 2.But there is an important problem and be in the method: the reference sequences chain usually making database is longer, and actual capabilities are more than 2*10 ⁹individual character.If with 36 characters for fragment, make Key-Value data pair, the index data be so only made up of Key, will produce (2*10 ⁹– 36) the huge data volume of * 36Bytes ≈ 67.05GB.Huge index data can the memory source of a large amount of consumption calculations system, and cause the Cache hit rate of Key-Value system to decline to a great extent, if when memory source is inadequate, also can cause significantly shaking because memory pages exchanges the system performance caused, thus make should very efficient exact matching, in Project Realization process, performance is had a greatly reduced quality.The method of existing condensed prefix tree can catch in index data, position and be worth all identical characters, merges, thus reduce the size of data directory in index tree.But the data after the method compression must adopt tree construction to carry out Key-Value inquiry, the degree of depth of its search efficiency and tree, the size of data volume are closely related, when data volume is large, the degree of depth of tree can deepen thereupon, its search efficiency can significantly decline, in addition, construct data space shared by a large amount of pointers needed for condensed prefix tree construction and also greatly offset compressed capability.

Summary of the invention

The technical problem to be solved in the present invention: for the problems referred to above of prior art, provides a kind of search efficiency when can improve big data quantity, and compressed capability is strong, take up room gene order fragment quick position key assignments index data compression method little.

In order to solve the problems of the technologies described above, the technical solution used in the present invention is:

A kind of gene order fragment quick position key assignments index data compression method, step comprises:

1) initialization compression result S set et _comp, the prefix length n that setting data compression adopts;

2) from gene sequence data S set et to be compressed _origthe current gene order fragment Key that middle taking-up one is to be compressed;

3) skew 0 that circulated respectively by current gene order fragment Key has the gene order fragment sequence Key of common prefix to (n-1) secondary formation n _r0, Key _r1..., Key _{r (n-1)}, n is prefix length, by all gene order fragment sequences based on common prefix and different circulations skew number of times and suffix add compression result S set et respectively _comp;

4) data acquisition Set to be compressed is judged _origwhether be empty, if non-NULL, take out current gene order fragment Key next to be compressed, and redirect perform step 2); Otherwise, by compression result S set et _compexport.

Preferably, described step 3) detailed step comprise:

3.1) skew 0 that circulated respectively by current gene order fragment Key has the gene order fragment sequence Key of common prefix to (n-1) secondary formation n _r0, Key _r1..., Key _{r (n-1)}, n is prefix length;

3.2) from described gene order fragment sequence Key _r0, Key _r1..., Key _{r (n-1)}middle selection gene order fragment sequence Key _rias current gene order fragment sequence;

3.3) by current gene order fragment sequence Key _ribe prefix Prefix according to prefix length n cutting _riwith suffix Postfix _ri, described prefix Prefix _riwith suffix Postfix _rilength sum be current gene order fragment sequence Key _rilength SL;

3.4) prefix Prefix is judged _ricorresponding mapping relations are integrated into compression result S set et _compin whether exist, if existed, then redirect perform step 3.5); Otherwise redirect performs step 3.6);

3.5) current gene order fragment sequence Key is judged _ridata <i, Postfix _ri> is at prefix Prefix _riwhether exist, if there is no, then by current gene order fragment sequence Key in corresponding mapping relations set _ridata <i, Postfix _ri> adds prefix Prefix _ricorresponding mapping relations set, wherein i represents current gene order fragment sequence Key _rithe number of times of circulation skew, Postfix _rifor current gene order fragment sequence Key _riprefix, redirect perform step 3.7); Otherwise, ignore current gene order fragment sequence Key _rithe follow-up gene order fragment sequence with common prefix, redirect performs step 4);

3.6) be prefix Prefix _rinewly-built mapping relations Prefix _ri→ { <i, Postfix _ri>} also adds compression result S set et _comp, wherein i represents current gene order fragment sequence Key _rithe number of times of circulation skew, Postfix _rifor current gene order fragment sequence Key _riprefix, redirect perform step 3.7);

3.7) gene order fragment sequence Key is judged _r0, Key _r1..., Key _{r (n-1)}whether be disposed, if be not yet disposed, then select next current gene order fragment sequence Key _riand redirect performs step 3.3), otherwise redirect performs step 4).

Preferably, described step 1) in the detailed step of prefix length n that adopts of setting data compression comprise:

1.1) compressibility function f (n) of prefix length n is constructed;

1.2) the prefix length n making the value of compressibility function f (n) reach maximal value is asked for.

Preferably, described step 1.1) in construct the compressibility function that obtains such as formula shown in (1);

f (n) = \frac{T L * S L * b}{8 * S (n)} - - - (1)

In formula (1), f (n) is compressibility function, and TL is data acquisition Set to be compressed _origlength, SL is data acquisition Set to be compressed _origin the length of gene order fragment Key to be compressed, b is data acquisition Set to be compressed _origin bit storage space in gene order fragment Key to be compressed shared by each element, the byte estimation function that when length that S (n) is prefix Prefix is n, index data accounts for, the computing function expression formula of byte estimation function S (n) that the length of described prefix Prefix accounts for for index data during n is such as formula shown in (2);

S (n) = \frac{(\log_{2} (S L - n) + (S L - n) * b) * T L}{8} + \frac{n * b * T L}{8 * (S L - n)} - - - (2)

In formula (2), the byte estimation function that when length that S (n) is prefix Prefix is n, index data accounts for, TL is data acquisition Set to be compressed _origlength, SL is data acquisition Set to be compressed _origin the length of gene order fragment Key to be compressed, b is data acquisition Set to be compressed _origin bit storage space in gene order fragment Key to be compressed shared by each element, n is prefix length.

Preferably, described circulation skew is ring shift left.

Preferably, prefix length n value is 32.

Gene order fragment quick position key assignments index data compression method of the present invention has following advantage: current gene order fragment Key is circulated skew 0 to the individual gene order fragment sequence Key with common prefix of (n-1) secondary formation n by the present invention respectively _r0, Key _r1..., Key _{r (n-1)}, n is prefix length, by all gene order fragment sequences based on common prefix and different circulations skew number of times and suffix add compression result S set et respectively _comp, by gene order fragment Key being cut into prefix (Prefix) and suffix (Postfix) two parts, by carrying out the circulation offset operation of certain number of times to gene order fragment, the sequence of seizure same prefix as much as possible in adjacent short fragment sequence, and by the prefix sequence of these gene orders fragment Key is merged, and together with the coding that suffix array and circulation are offset number of times, associating is unique represents a specific gene order fragment Key, greatly can save the storage space of these index short data records like this, simultaneously, owing to only having prefix and suffix two-stage sequence, the progression that the present invention does not exist traditional prefix compressed tree increases with data scale and increases the defect caused, search efficiency during big data quantity can be improved, there is compressed capability strong, take up room little advantage.

Accompanying drawing explanation

Fig. 1 is the principle schematic that prior art builds the key-value pair data storehouse of gene order fragment.

Fig. 2 is the process flow diagram of embodiment of the present invention method.

Fig. 3 is the principle schematic that the embodiment of the present invention builds the key-value pair data storehouse of gene order fragment.

Fig. 4 is embodiment of the present invention method step 3) process flow diagram.

Embodiment

As shown in Figure 2, the step of the present embodiment gene order fragment quick position key assignments index data compression method comprises:

Can find according to key-value pair data storehouse building process, because gene order fragment Key data are started and the sequence of intercept one section of length-specific by character one by one, in its adjacent short data records (n character) repeatedly intercepted, actual have most of duplicate repeat character (RPT).In the present embodiment, for n by the gene order fragment sequence Key obtained after circulation skew _r0, Key _r1..., Key _{r (n-1)}, each gene order fragment sequence is encapsulated as based on common prefix and different circulations skew number of times and suffix add compression result S set et _comp.Definition circulation offset operation symbol << _rn represents sequential element circulation skew n, as shown in Figure 3, with T, G, C, A adjacent 3 times from data acquisition Set to be compressed _origthe short data records character string intercepted is example, by gene order fragment Key (T, G, C, A), circulation skew 0 is secondary to (n-1) respectively, form gene order fragment sequence T, G, C, A respectively, gene order fragment sequence T, G, C, G, gene order fragment sequence T, G, C, G, therefore gene order fragment sequence can be expressed as TG<< respectively _r0CA, TG<< _r1CG, TG<< _r2CG, circulation offset operation symbol << _rn comprises circulation skew number of times, circulation offset operation symbol << _rtG on front side of n is common prefix, circulation offset operation symbol << _rsuffix is on rear side of n.It should be noted that, be only the exemplary illustration carried out for the gene order fragment Key of 4 bases, also can adopt the gene order fragment of other quantity bases in addition as required, its principle is identical with the present embodiment, therefore does not repeat them here herein.

Known see Fig. 3, when the skew number of times that circulates is 0, suffix C, A are all positioned at the rear side of common prefix TG before circulation skew; When circulation skew number of times is 1, suffix C is in circulation skew anteposition in the rear side of common prefix TG, and suffix G offsets anteposition in the front side of common prefix TG in circulation; When the skew number of times that circulates is 2, suffix C, G are all positioned at the front side of common prefix TG before circulation skew.Therefore, based on the principle that above-mentioned circulation offsets, can reduce rapidly according to the gene order fragment sequence after compression the raw data obtaining gene order fragment sequence.In the present embodiment, circulation skew is ring shift left, and certainly the ultimate principle of ring shift right is identical with ring shift left, therefore does not repeat them here its concrete implementation detail.

The present embodiment step 1) in the detailed step of prefix length n that adopts of setting data compression comprise:

1.1) compressibility function f (n) of prefix length n is constructed;

Step 1.1) in construct the compressibility function that obtains such as formula shown in (1);

f (n) = \frac{T L * S L * b}{8 * S (n)} - - - (1)

S (n) = \frac{(\log_{2} (S L - n) + (S L - n) * b) * T L}{8} + \frac{n * b * T L}{8 * (S L - n)} - - - (2)

As shown in Figure 3, step 3) detailed step comprise:

In the present embodiment, compression result S set et _compdata structure as follows:

{

prefix ₁→{<rotate ₁,postfix ₁>,<rotate ₂,postfix ₂>,…}，

prefix ₂→{<rotate ₃,postfix ₃>,…}，

…}

In above-mentioned data structure, prefix ₁be the common prefix of first gene order fragment, prefix ₁→ { <rotate ₁, postfix ₁>, <rotate ₂, postfix ₂> ... be prefix Prefix _ricorresponding mapping relations set, rotate ₁for having prefix prefix ₁first gene order fragment sequence circulation skew number of times, postfix ₁for having prefix prefix ₁the suffix of first gene order fragment sequence, rotate ₂for having prefix prefix ₁second gene order fragment sequence circulation skew number of times, postfix ₂for having prefix prefix ₁the suffix of second gene order fragment sequence; Prefix ₂be the common prefix of second gene order fragment, prefix ₂→ { <rotate ₃, postfix ₃> ... be prefix prefix ₂corresponding mapping relations set, rotate ₃for having prefix prefix ₂first gene order fragment sequence circulation skew number of times, postfix ₃for having prefix prefix ₁the suffix of first gene order fragment sequence.

In the present embodiment, data acquisition Set to be compressed _origlength TL=2*10 ⁹the length SL=36 of gene order fragment Key, bit storage space b=2bits (because effective reference sequences composition only has ACGT) in gene order fragment Key shared by each element, choose the prefix length n=32 that data compression adopts, thus the length that can calculate prefix Prefix byte estimation function S (n)=6500000000Bytes that index data accounts for when being 32, i.e. 6.05GB, relative to size of data (TL*SL*b/8=2*10 when not compressing ⁹× 36 × 2/8Bytes=16.76GB), the present embodiment gene order fragment quick position key assignments index data compression method can reach the compressibility of nearly 2.8 times, therefore search efficiency when the present embodiment can improve big data quantity, has that compressed capability is strong, take up room little advantage.

The above is only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a gene order fragment quick position key assignments index data compression method, is characterized by step and comprises:

2. gene order fragment quick position key assignments index data compression method according to claim 1, is characterized in that, described step 3) detailed step comprise:

3. gene order fragment quick position key assignments index data compression method according to claim 2, is characterized in that, described step 1) in the detailed step of prefix length n that adopts of setting data compression comprise:

1.1) compressibility function f (n) of prefix length n is constructed;

4. gene order fragment quick position key assignments index data compression method according to claim 3, is characterized in that, described step 1.1) in construct the compressibility function that obtains such as formula shown in (1);

f (n) = \frac{T L * S L * b}{8 * S (n)} - - - (1)

S (n) = \frac{(\log_{2} (S L - n) + (S L - n) * b) * T L}{8} + \frac{n * b * T L}{8 * (S L - n)} - - - (2)

5. according to the gene order fragment quick position key assignments index data compression method in Claims 1 to 4 described in any one, it is characterized in that, described circulation skew is ring shift left.

6. gene order fragment quick position key assignments index data compression method according to claim 5, it is characterized in that, prefix length n value is 32.