CN105224828B - A kind of gene order fragment is quickly positioned with key assignments index data compression method - Google Patents
A kind of gene order fragment is quickly positioned with key assignments index data compression method Download PDFInfo
- Publication number
- CN105224828B CN105224828B CN201510648867.2A CN201510648867A CN105224828B CN 105224828 B CN105224828 B CN 105224828B CN 201510648867 A CN201510648867 A CN 201510648867A CN 105224828 B CN105224828 B CN 105224828B
- Authority
- CN
- China
- Prior art keywords
- key
- prefix
- gene order
- mrow
- order fragment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of quick positioning key assignments index data compression method of gene order fragment, step includes:1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;2) from gene sequence data set Set to be compressedorigOne current gene order fragment Key to be compressed of middle taking-up;3) current gene order fragment Key is circulated into skew 0 to (n 1) the secondary formation n gene order fragment sequence Key with common prefix respectivelyr0,Keyr1,…,Keyr(n‑1), n is prefix length, and all gene order fragment sequences are offset into number of times based on common prefix and different circulations and suffix is separately added into compression result set Setcomp;4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current gene order fragment Key to be compressed is taken out if non-NULL, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.The present invention can improve search efficiency during big data quantity, have the advantages that compressed capability is strong, occupy little space.
Description
Technical field
The present invention relates to the bioinformatic analysis technology of gene sequencing data, and in particular to a kind of gene order fragment is fast
Speed positioning key assignments index data compression method.
Background technology
Sequencing sequence location technology is the basis of current high flux gene sequencing data analysis.Sequence fragment is generally used
The methods such as BWA carry out the optimal string matching of tolerable partial error.It is in most cases, most of but actual experiment shows
Obtained sequence fragment, which is sequenced, can be dispersed as shorter gene order fragment (36BP), and be reflected by accurate Key-Value
Shooting method carries out accurately and fast matching completely.
In order to be able to allow short gene order quickly and accurately to be matched in reference chain, it is necessary to first with the data of reference chain
Based on, Key-Value index data bases are made, are set up as follows:As reference chain data are:ACGTGCA, if needing
By the database of the key-value pair (Key-Value to) of 4 characters, one group of short sequences match of structure, as shown in Figure 1., will referring to Fig. 1
From back to front, character starts reference chain data one by one, using 4 characters as length, can obtain 4 groups of Key-Value to as looking into
Ask the data of database.If it is " GTGC " that obtained short sequence, which is sequenced, is mapped by Key-Value, can quickly obtain GTGC
The Offset (skew) that reference sequences should be located at is 2 position.But, this method exist one it is important the problem of be:Generally make
The reference sequences chain for being used as database is longer, and actual capabilities are more than 2*109Individual character.If using 36 characters as fragment, making
Key-Value data pair, then the index data being only made up of Key, will produce (2*109–36)*36Bytes≈67.05GB
Huge data volume.Huge index data can a large amount of consumption calculations systems memory source, and cause Key-Value systems
Cache hit rates decline to a great extent, if in the case that memory source is inadequate, also resulting in the systematicness caused by memory pages are exchanged
Can significantly it shake, so that very efficient should accurately match, during Project Realization, performance is had a greatly reduced quality.It is existing
The method of condensed prefix tree can catch in index data, position and be worth all identical characters, merged in index tree, from
And reduce the size of data directory.But the data after this method compression must carry out Key-Value inquiries using tree construction, and it is looked into
Ask efficiency and depth, the size of data volume of tree are closely related, when data volume is big, the depth of tree can be deepened therewith, and it is inquired about
Efficiency can be remarkably decreased, in addition, data space shared by a large amount of pointers needed for construction condensed prefix tree construction also offsets pressure significantly
Contracting ability.
The content of the invention
The technical problem to be solved in the present invention:Above mentioned problem for prior art there is provided one kind can improve big data
Search efficiency during amount, compressed capability is strong, quickly positioning key assignments index data compresses for the gene order fragment that occupies little space
Method.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is:
A kind of gene order fragment, which is quickly positioned, uses key assignments index data compression method, and step includes:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment to be compressed of middle taking-up
Key;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n bases with common prefix respectively
Because of sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on altogether
Compression result set Set is separately added into prefix and different circulation skew number of times and suffixcomp;
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current base to be compressed is taken out if non-NULL
Because of sequence fragment Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
Preferably, the step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) secondary formation n with common prefix respectively
Gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment of middle selection
Sequence KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix
Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's
Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether deposited
If existed, execution step 3.5 is being redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriCorrespondence
Mapping relations set in whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I,
Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri
Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no
Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step
It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result collection
Close Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene sequence
Row fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if still
It is untreated to finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect and hold
Row step 4).
Preferably, the step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
Preferably, the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data to be compressed
Set SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn base to be compressed
Because of the bit storage space shared by each element in sequence fragment Key, the length that S (n) is prefix Prefix is indexed when being n
The byte estimation function that data are accounted for, the byte estimation function S's (n) that index data is accounted for when the length of the prefix Prefix is n
Calculate shown in function expression such as formula (2);
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is to wait to press
Contracting data acquisition system SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b
For data acquisition system Set to be compressedorigIn bit storage in gene order fragment Key to be compressed shared by each element it is empty
Between, n is prefix length.
Preferably, the circulation skew is ring shift left.
Preferably, prefix length n values are 32.
The quick positioning key assignments index data compression method tool of gene order fragment of the present invention has the advantage that:The present invention will
Current gene order fragment Key circulates skew 0 to (n-1) the secondary formation n gene order fragment sequences with common prefix respectively
Arrange Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on into common prefix and difference
Circulation skew number of times and suffix be separately added into compression result set Setcomp, before gene order fragment Key is cut into
Sew (Prefix) and suffix (Postfix) two parts, by carrying out the circulation offset operation of certain number of times to gene order fragment,
The same prefix sequence of seizure as much as possible in adjacent short-movie section sequence, and by by before these gene order fragments Key
Sew sequence merging, and by suffix array together with the coding that circulation offsets number of times, joint uniquely represents a specific gene sequence
Column-slice section Key, can so greatly save these memory spaces for indexing short sequence, simultaneously as only prefix and suffix two
Level sequence, the series of the invention that traditional prefix compressed tree is not present increases with data scale and increases caused defect, can
Search efficiency during big data quantity is improved, has the advantages that compressed capability is strong, occupy little space.
Brief description of the drawings
Fig. 1 is the principle schematic in the key-value pair data storehouse that prior art builds gene order fragment.
Fig. 2 is the flow chart of present invention method.
Fig. 3 is the principle schematic in the key-value pair data storehouse that the embodiment of the present invention builds gene order fragment.
Fig. 4 be present invention method step 3) flow chart.
Embodiment
As shown in Fig. 2 the present embodiment gene order fragment is wrapped the step of quickly positioning with key assignments index data compression method
Include:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment to be compressed of middle taking-up
Key;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n bases with common prefix respectively
Because of sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on altogether
Compression result set Set is separately added into prefix and different circulation skew number of times and suffixcomp;
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current base to be compressed is taken out if non-NULL
Because of sequence fragment Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
According to key-value pair data storehouse building process it can be found that because gene order fragment Key data are by word one by one
In the sequence for one section of length-specific that symbol starts and intercepted, its adjacent short sequence (n character) repeatedly intercepted, actually there is big portion
Divide duplicate repeat character (RPT).In the present embodiment, for the n gene order fragment sequences obtained after circulation skew
Keyr0,Keyr1,…,Keyr(n-1), each gene order fragment sequence is encapsulated as based on common prefix and different followed
Ring offsets number of times and suffix and adds compression result set Setcomp.Definition circulation offset operation symbol<<RN is represented sequential element
Circulation skew n, as shown in figure 3, with T, G, C, A it is adjacent 3 times from data acquisition system Set to be compressedorigThe short sequence of interception
Exemplified by character string, by gene order fragment Key (T, G, C, A) respectively circulate skew 0 to (n-1) it is secondary, form gene order respectively
Fragment sequence T, G, C, A, gene order fragment sequence T, G, C, G, gene order fragment sequence T, G, C, G, therefore gene order
Fragment sequence can be expressed as TG respectively<<R0CA、TG<<R1CG、TG<<R2CG, circulation offset operation symbol<<RN includes circulation and offset
Number of times, circulation offset operation symbol<<RTG on front side of n is common prefix, circulation offset operation symbol<<ROn rear side of n is suffix.
It should be noted that being only the exemplary illustration carried out by taking the gene order fragment Key of 4 bases as an example herein, in addition
The gene order fragment of other quantity base can be used as needed, and its principle is identical with the present embodiment, therefore no longer goes to live in the household of one's in-laws on getting married herein
State.
It can be seen from Fig. 3, when circulation skew number of times is 0, suffix C, A are respectively positioned on common prefix TG's before circulation skew
Rear side;When circulation skew number of times is 1, suffix C is in circulation skew anteposition in common prefix TG rear side, and suffix G is inclined in circulation
Anteposition is moved in common prefix TG front side;When it is 2 to circulate skew number of times, before suffix C, G are respectively positioned on jointly before circulation skew
Sew TG front side.Therefore, the principle offset based on above-mentioned circulation, can rapidly be gone back according to the gene order fragment sequence after compression
Original obtains the initial data of gene order fragment sequence.In the present embodiment, circulation skew is ring shift left, is certainly circulated right
The general principle of shifting is identical with ring shift left, therefore its specific implementation details that will not be repeated here.
The present embodiment step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
Step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data to be compressed
Set SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn base to be compressed
Because of the bit storage space shared by each element in sequence fragment Key, the length that S (n) is prefix Prefix is indexed when being n
The byte estimation function that data are accounted for, the byte estimation function S's (n) that index data is accounted for when the length of the prefix Prefix is n
Calculate shown in function expression such as formula (2);
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is to wait to press
Contracting data acquisition system SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b
For data acquisition system Set to be compressedorigIn bit storage in gene order fragment Key to be compressed shared by each element it is empty
Between, n is prefix length.
As shown in figure 4, step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) secondary formation n with common prefix respectively
Gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment of middle selection
Sequence KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix
Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's
Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether deposited
If existed, execution step 3.5 is being redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriCorrespondence
Mapping relations set in whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I,
Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri
Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no
Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step
It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result collection
Close Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene sequence
Row fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if still
It is untreated to finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect and hold
Row step 4).
In the present embodiment, compression result set SetcompData structure it is as follows:
{
prefix1→{<rotate1,postfix1>,<rotate2,postfix2>...,
prefix2→{<rotate3,postfix3>...,
…}
In above-mentioned data structure, prefix1For the common prefix of first gene order fragment, prefix1→{<
rotate1,postfix1>,<rotate2,postfix2>... } and it is prefix PrefixriCorresponding mapping relations set,
rotate1For with prefix prefix1First gene order fragment sequence circulation skew number of times, postfix1For with
Prefix prefix1First gene order fragment sequence suffix, rotate2For with prefix prefix1Second gene
The circulation skew number of times of sequence fragment sequence, postfix2For with prefix prefix1Second gene order fragment sequence
Suffix;prefix2For the common prefix of second gene order fragment, prefix2→{<rotate3,postfix3>... } be
Prefix prefix2Corresponding mapping relations set, rotate3For with prefix prefix2First gene order fragment sequence
Circulation skew number of times, postfix3For with prefix prefix1First gene order fragment sequence suffix.
In the present embodiment, data acquisition system Set to be compressedorigLength TL=2*109, gene order fragment Key length SL=
Bit storage space b=2bits in 36, gene order fragment Key shared by each element is (because effective set of reference sequences
Into only ACGT), the prefix length n=32 that data compression is used is chosen, is so as to the length that calculates prefix Prefix
The byte estimation function S (n) that index data is accounted for when 32=6500000000Bytes, i.e. 6.05GB, relative to number when not compressing
According to size (TL*SL*b/8=2*109× 36 × 2/8Bytes=16.76GB) for, the present embodiment gene order fragment is quick
Positioning can reach nearly 2.8 times of compression ratio with key assignments index data compression method, therefore the present embodiment can improve big data quantity
When search efficiency, have the advantages that compressed capability is strong, occupy little space.
Described above is only the preferred embodiment of the present invention, and protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the art
Those of ordinary skill for, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (6)
1. a kind of gene order fragment is quickly positioned with key assignments index data compression method, it is characterized in that step includes:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment Key to be compressed of middle taking-up;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n gene sequences with common prefix respectively
Row fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, before all gene order fragment sequences are based on jointly
Sew and be separately added into compression result set Set with different circulations skew number of times and suffixcomp;
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current gene sequence to be compressed is taken out if non-NULL
Column-slice section Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
2. gene order fragment according to claim 1, which is quickly positioned, uses key assignments index data compression method, its feature exists
In the step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n genes with common prefix respectively
Sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment sequence of middle selection
KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix
Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's
Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether existed,
If existed, execution step 3.5 is redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriIt is corresponding to reflect
Penetrate in set of relationship and whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I,
Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri
Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no
Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step
It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result set
Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene order
Fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if not yet handled
Finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect execution step
4)。
3. gene order fragment according to claim 2, which is quickly positioned, uses key assignments index data compression method, its feature exists
In the step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
4. gene order fragment according to claim 3, which is quickly positioned, uses key assignments index data compression method, its feature exists
In the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
<mrow>
<mi>f</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>T</mi>
<mi>L</mi>
<mo>*</mo>
<mi>S</mi>
<mi>L</mi>
<mo>*</mo>
<mi>b</mi>
</mrow>
<mrow>
<mn>8</mn>
<mo>*</mo>
<mi>S</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
1
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data acquisition system to be compressed
SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn gene sequence to be compressed
Bit storage space in column-slice section Key shared by each element, the index data when length that S (n) is prefix Prefix is n
The byte estimation function accounted for, the calculating for the byte estimation function S (n) that index data the is accounted for when length of the prefix Prefix is n
Shown in function expression such as formula (2);
<mrow>
<mi>S</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mrow>
<mo>(</mo>
<mrow>
<mi>S</mi>
<mi>L</mi>
<mo>-</mo>
<mi>n</mi>
</mrow>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mrow>
<mi>S</mi>
<mi>L</mi>
<mo>-</mo>
<mi>n</mi>
</mrow>
<mo>)</mo>
</mrow>
<mo>*</mo>
<mi>b</mi>
</mrow>
<mo>)</mo>
</mrow>
<mo>*</mo>
<mi>T</mi>
<mi>L</mi>
</mrow>
<mn>8</mn>
</mfrac>
<mo>+</mo>
<mfrac>
<mrow>
<mi>n</mi>
<mo>*</mo>
<mi>b</mi>
<mo>*</mo>
<mi>T</mi>
<mi>L</mi>
</mrow>
<mrow>
<mn>8</mn>
<mo>*</mo>
<mrow>
<mo>(</mo>
<mrow>
<mi>S</mi>
<mi>L</mi>
<mo>-</mo>
<mi>n</mi>
</mrow>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is number to be compressed
According to set SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b is treats
Compressed data set SetorigIn bit storage space in gene order fragment Key to be compressed shared by each element, n
For prefix length.
5. the gene order fragment according to any one in Claims 1 to 4 quickly with key assignments index data compressed by positioning
Method, it is characterised in that the circulation skew is ring shift left.
6. gene order fragment according to claim 5, which is quickly positioned, uses key assignments index data compression method, its feature exists
In prefix length n values are 32.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510648867.2A CN105224828B (en) | 2015-10-09 | 2015-10-09 | A kind of gene order fragment is quickly positioned with key assignments index data compression method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510648867.2A CN105224828B (en) | 2015-10-09 | 2015-10-09 | A kind of gene order fragment is quickly positioned with key assignments index data compression method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105224828A CN105224828A (en) | 2016-01-06 |
CN105224828B true CN105224828B (en) | 2017-10-27 |
Family
ID=54993793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510648867.2A Active CN105224828B (en) | 2015-10-09 | 2015-10-09 | A kind of gene order fragment is quickly positioned with key assignments index data compression method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105224828B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930104B (en) * | 2016-05-17 | 2019-01-18 | 百度在线网络技术(北京)有限公司 | Date storage method and device |
CN106484865A (en) * | 2016-10-10 | 2017-03-08 | 哈尔滨工程大学 | One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem |
CN106897582B (en) * | 2017-01-25 | 2018-03-09 | 人和未来生物科技(长沙)有限公司 | A kind of heterogeneous platform understood towards gene data |
CN110428868B (en) * | 2018-04-27 | 2021-11-26 | 人和未来生物科技(长沙)有限公司 | Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data |
CN110060731B (en) * | 2019-04-12 | 2022-10-21 | 福建师范大学 | Method for determining number of overlapped gene pairs among genes based on distributed calculation |
CN110782946A (en) * | 2019-10-17 | 2020-02-11 | 南京医基云医疗数据研究院有限公司 | Method and device for identifying repeated sequence, storage medium and electronic equipment |
CN112765113B (en) * | 2021-01-31 | 2024-04-09 | 云知声智能科技股份有限公司 | Index compression method, index compression device, computer readable storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101036141A (en) * | 2004-03-26 | 2007-09-12 | 甲骨文国际有限公司 | A database management system with persistent, user- accessible bitmap values |
CN101499094A (en) * | 2009-03-10 | 2009-08-05 | 焦点科技股份有限公司 | Data compression storing and retrieving method and system |
CN102831224A (en) * | 2012-08-24 | 2012-12-19 | 北京百度网讯科技有限公司 | Creating method for data index base and searching suggest generation method and device |
CN103870492A (en) * | 2012-12-14 | 2014-06-18 | 腾讯科技(深圳)有限公司 | Data storing method and device based on key sorting |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9715525B2 (en) * | 2013-06-28 | 2017-07-25 | Khalifa University Of Science, Technology And Research | Method and system for searching and storing data |
-
2015
- 2015-10-09 CN CN201510648867.2A patent/CN105224828B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101036141A (en) * | 2004-03-26 | 2007-09-12 | 甲骨文国际有限公司 | A database management system with persistent, user- accessible bitmap values |
CN101499094A (en) * | 2009-03-10 | 2009-08-05 | 焦点科技股份有限公司 | Data compression storing and retrieving method and system |
CN102831224A (en) * | 2012-08-24 | 2012-12-19 | 北京百度网讯科技有限公司 | Creating method for data index base and searching suggest generation method and device |
CN103870492A (en) * | 2012-12-14 | 2014-06-18 | 腾讯科技(深圳)有限公司 | Data storing method and device based on key sorting |
Also Published As
Publication number | Publication date |
---|---|
CN105224828A (en) | 2016-01-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105224828B (en) | A kind of gene order fragment is quickly positioned with key assignments index data compression method | |
CN110413611B (en) | Data storage and query method and device | |
US20200285634A1 (en) | System for data sharing platform based on distributed data sharing environment based on block chain, method of searching for data in the system, and method of providing search index in the system | |
CN103703467B (en) | Method and apparatus for storing data | |
US10938961B1 (en) | Systems and methods for data deduplication by generating similarity metrics using sketch computation | |
EP3072076B1 (en) | A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure | |
CN106326475B (en) | Efficient static hash table implementation method and system | |
CN106528647B (en) | One kind carrying out the matched method of term based on cedar even numbers group dictionary tree algorithm | |
CN103189867A (en) | Duplicated data search method and equipment | |
US9953058B1 (en) | Systems and methods for searching large data sets | |
CN104268157A (en) | Device and method for error correction in data search | |
CN105677683A (en) | Batch data query method and device | |
CN111801665A (en) | Hierarchical Locality Sensitive Hash (LSH) partition indexing for big data applications | |
US11275731B2 (en) | Accelerated filtering, grouping and aggregation in a database system | |
US8271500B2 (en) | Minimal perfect hash functions using double hashing | |
CN104618361A (en) | Network stream data reordering method | |
Hon et al. | Towards an optimal space-and-query-time index for top-k document retrieval | |
CN113468571B (en) | Source tracing method based on block chain | |
KR20230170891A (en) | In-memory efficient multistep search | |
CN112800067A (en) | Range query method and device, computer readable storage medium and electronic equipment | |
CN104794129A (en) | Data processing method and system based on query logs | |
CN116521733A (en) | Data query method and device | |
CN102456073A (en) | Partial extremum inquiry method | |
CN104750846A (en) | Method and device for finding substring | |
CN107203550B (en) | Data processing method and database server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |