CN104284189B - A kind of improved BWT data compression methods and its system for implementing hardware - Google Patents
A kind of improved BWT data compression methods and its system for implementing hardware Download PDFInfo
- Publication number
- CN104284189B CN104284189B CN201410571262.3A CN201410571262A CN104284189B CN 104284189 B CN104284189 B CN 104284189B CN 201410571262 A CN201410571262 A CN 201410571262A CN 104284189 B CN104284189 B CN 104284189B
- Authority
- CN
- China
- Prior art keywords
- lyndon
- word
- character string
- module
- length
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The present invention provides the present invention and provides a kind of system for implementing hardware of BWTS methods, including:Input buffer module, for keeping in pending character string, and synchrodata input and data processing;Lyndon Word searching moduls, the Lyndon Word of searching data block;Lyndon Word cache modules, keep in Lyndon Word;Lyndon Word length cache modules, keep in the length and number of all Lyndon Word;Transposition module, completes the transposition of all Lyndon Word;Transposition cache module, keeps in transposition result;Order module, all character strings that transposition is completed sort by lexcographical order, and take output of last row as BWTS methods;Output buffer module, keeps in the character string of output, is used for subsequent module.Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, the constant that can change existing BWT methods must be generated by direct transform could realize the situation that character string is recovered, to improve the operational efficiency of data compression method.
Description
Technical field
The present invention relates to technical field of data compression, more particularly to a kind of improved BWT data compression methods and its hardware
Realize system.
Background technology
Data compression technique is always the study hotspot of information science, and it has widely in terms of data storage and transmission
Using.Although data storage device capacity constantly expands and network transfer speeds are improved constantly, but the diversity of data and quick-fried
Hairdo growth property so that efficient compression method turns into the important means for effectively reducing storage and transmission cost.Data compression point
It is Lossless Compression and lossy compression method.Lossy compression method allows a certain degree of information to lose, and is passed in multimedia interactive system, video
It is used widely in the field such as defeated business and home entertaining.Lossless Compression is the reversible encoding based on information entropy principle, in not shadow
The redundancy in information source is removed on the premise of ringing comentropy, the information after compression can be reduced, its remote sensing image processing,
Medical imaging treatment, history archive are all widely used in preserving the fields such as analysis and many mixed image compression methods.Most
The removal redundancy of limits is the target that Lossless Compression is pursued.At present to compression method performance evaluation refer mainly to indicate pressure
Contracting ratio and compression speed etc..BWT is the transformation idea that MikeBurrows is proposed according to DavidWheeler, improves and successfully should
For the transform method of real data compression, the conversion is the study hotspot in current Lossless Compression field.BWT is one kind with data
Block is the reversible data conversion method of operation object, and its core concept is that the character matrix obtained after being rotated to character string is carried out
Sequence and conversion.Itself will not reduce data volume, but the data after conversion are easier to compression, so BWT is that data are entered
Pretreatment before row compression.
Fig. 1 shows a kind of Bzip2 data compression systems of efficiently increasing income based on BWT methods of the prior art.Such as Fig. 1
Shown, character string S occurs continuous identical characters after BWT methods, is processed through MTF methods after, and the result for obtaining will be
Continuous 0 and a series of small integer, for further reducing overall entropy;Finally encoded with cum rights path using Huffman
The form of minimum binary tree carries out data compression, obtains compression ratio higher.Further, since the similitude of BWT and suffix array,
So that BWT is used as the string matching in FM-index methods.It is stronger interior that BWT methods cause that data in character block occur
Poly- property, i.e., identical characters condense together, and this feature causes that follow-up compression method has more preferable compression ratio.The method changes
The limitation that compression method must be processed with data flow model is become so that be treated as can for character block in compression method
Can, this is the revolutionary progress in Lossless Compression field.In addition, BWT methods are also applied to bioinformatics, for full base
Because of group comparison, the range measurement between genome annotation and two genome sequences.Channel is often used as among communication system
Coding.
Fig. 2 shows the data compression schematic diagram based on BWT data compression methods of the prior art.As shown in Fig. 2 under
Figure illustrates the general principle of BWT methods, and the block for realizing data with BWT methods is processed.Assuming that character string of the input length for n
(Block)S=' ABRACA ', by character string S cyclic shifts formed n*n matrix M, to M in every a line according to lexcographical order sequence, structure
Make matrix Q.Last row for taking Q are just output sequence L=' CARAAB ', position of the source string in Q(Line number)Just it is output
Constant index=1.However, among numerous applications, BWT methods are often used as early stage treatment, it is the character of n for length
String, it is the character string and a constant of n that length is formed after being processed through BWT methods.The presence of the constant is to many subsequent treatment bands
It is inconvenient to come.For example when BWT is used for channel coding, due to influence of noise, if the constant is lost, the character string will be unable to extensive
It is multiple.It is the character string of n for length when for Lossless Compression, after treatment, the character string that length is n+1 can be changed into, because
This changes the entropy of character string.At present, not yet find both at home and abroad for the research without suffix BWT methods
In view of this, the problem for existing for current BWT methods, it is necessary to propose a kind of improved BWT data compression sides
Method, the constant that can change existing BWT methods must be generated by direct transform could realize the situation that character string is recovered, to carry
The operational efficiency of data compression method high.
The content of the invention
In order to overcome the weak point of the prior art of above-mentioned meaning, the present invention is directed to propose a kind of improved BWT numbers
According to compression method, the constant that can change existing BWT methods must be generated by direct transform could realize what character string was recovered
Situation, to improve the operational efficiency of data compression method.
To achieve these goals, the present invention provides a kind of BWTS methods and its system for implementing hardware, including:Input-buffer
Module, for keeping in pending character string, and synchrodata input and data processing, after having processed by character string export to
Lyndon Word searching moduls;Lyndon Word searching moduls, for searching and coming from input buffer module character string in
Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each most
The length of Lyndon Word long is exported to Lyndon Word length cache modules;Lyndon Word cache modules, for keeping in
Output is used from the Lyndon Word of Lyndon Word searching moduls for transposition module;Lyndon Word length caches mould
Block, length and number for keeping in all Lyndon Word found in Lyndon Word searching moduls make for order module
With;Transposition module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and transposition result is temporary
Deposit to transposition cache module;Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;Row
Sequence module, for all character strings in transposition cache module to be sorted by lexcographical order, and takes last row as BWTS methods
Output, and be temporarily stored in output buffer module;Output buffer module, the character string for keeping in output, uses for subsequent module.
The Lyndon Word searching moduls are further included:Character string submodule is taken, for being taken from input buffer module
Character, and record the length of now taken character string, reads in since the character string initial character by turn, often increase by one it is just that its is defeated
Entering subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, by string length input Lyndon Word
Length cache module, length zero setting, next time takes the last character that character string then takes character string since this time;Displacement submodule
Block, for will take character string submodule character string be input into Lyndon Word judging submodules and by the character string carry out by
Secondary displacement, and all shift character strings are input into N*N registers;N*N registers, treating for a bit submodule is come from for storing
Judge all shift characters statements based on collusion Lyndon Word judging submodule treatment of character string;Lyndon Word judging submodules,
For gradually taking out shift character string from N*N registers and being contrasted with former character string, wherein:Contrast number is character to be judged
The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, by the character string
Export to Lyndon Word cache modules.
The transposition module is further included:Lyndon Word length most long differentiates submodule, for by Lyndon
Content in Word length temporary storage modules differentiates the length of the Lyndon Word most long of processed character string, and the numerical value is passed
Deliver to character string extension submodule;Character string extends submodule, for by all Lyndon in Lyndon Word cache modules
The length that Word extends to Lyndon Word most long is used for cyclic shift submodule;Cyclic shift submodule, in the future
From the Lyndon Word in character string extension submodule successively cyclic shift, and store to transposition cache module.
The order module includes:Sorting sub-module, for the character string in transposition cache module to be arranged according to lexcographical order
Sequence is used for BWTS result acquisition modules;BWTS result acquisition modules, for by the ranking results of sorting sub-module last
Row read, and as the output of BWTS methods, and keep in output buffer module.
To achieve these goals, the present invention also provides a kind of improved BWT data compression methods, including:Input character
String, keeps in pending character string, and synchrodata input and data processing, by character string after having processed by input buffer module
Export to Lyndon Word searching moduls;Searched by Lyndon Word searching moduls and come from input buffer module character string
In Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, will
The length of each Lyndon Word most long is exported to Lyndon Word length cache modules;Mould is cached by Lyndon Word
Block is kept in output and is used for transposition module from the Lyndon Word of Lyndon Word searching moduls;It is long by Lyndon Word
The length and number supply and discharge sequence mould of all Lyndon Word found in the temporary Lyndon Word searching moduls of degree cache module
Block is used;The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and by transposition knot
Fruit is kept in transposition cache module;The transposition result exported by the temporary transposition module of transposition cache module is made for order module
With;The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS
The output of method, and it is temporarily stored in output buffer module;The character string of output is kept in by output buffer module, is made for subsequent module
With.
The Lyndon Word of the searching data block are further included:Character is taken from input buffer module, and is recorded now
The length of taken character string, is read in since the character string initial character by turn, and often increasing by one and being just inputted subsequent module is carried out
Lyndon Word judge, if there are Lyndon Word, by string length input Lyndon Word length cache module,
Length zero setting, next time takes the last character that character string then takes character string since this time;The word of character string submodule will be taken
Symbol string is input into Lyndon Word judging submodules and is gradually shifted the character string, and all shift character strings are defeated
Enter N*N registers;Come from all shift characters statements based on collusion of the character string to be judged of a bit submodule by N*N register storages
The treatment of Lyndon Word judging submodules;Shift character string is gradually taken out from N*N registers and is contrasted with former character string, its
In:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is
Lyndon Word, the character string is exported to Lyndon Word cache modules.
The transposition for completing all Lyndon Word is further included:By in Lyndon Word length temporary storage modules
Content differentiate processed character string Lyndon Word most long length, and the numerical value is sent to character string extension submodule
Block;All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following
Ring displacement submodule is used;The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to turning
Put cache module.
All character strings that transposition is completed are further included by lexcographical order sequence:By the word in transposition cache module
Symbol string is used according to lexcographical order sequence for BWTS result acquisition modules;Last row of the ranking results of sorting sub-module are read
Go out, as the output of BWTS methods, and keep in output buffer module.
Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, can change existing BWT methods
The constant that must be generated by direct transform could realize the situation that character string is recovered, and be imitated with the operation for improving data compression method
Rate.
The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description
Obtain substantially, or recognized by practice of the invention.
Brief description of the drawings
Fig. 1 shows a kind of Bzip2 data compression systems of efficiently increasing income based on BWT methods of the prior art;
Fig. 2 shows the data compression schematic diagram based on BWT data compression methods of the prior art;
Fig. 3 shows the canonical schema that Lyndon Word are divided;
Fig. 4 shows the system for implementing hardware of a kind of improved BWT data compression methods of present invention offer;
Fig. 5 shows the Lyndon Word searching modul structural representations of the system for implementing hardware of present invention offer;
Fig. 6 shows the transposition modular structure schematic diagram of the system for implementing hardware of present invention offer;
Fig. 7 shows the order module structural representation of the system for implementing hardware of present invention offer.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings, wherein ad initio
Same or similar element or element with same or like function are represented to same or similar label eventually.Below by ginseng
The implementation method for examining Description of Drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " " used herein, " one
It is individual ", " described " and " being somebody's turn to do " may also comprise plural form.It is to be further understood that what is used in specification of the invention arranges
Diction " including " refer to the presence of the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition
One or more other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim unit
Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or can also exist
Intermediary element.Additionally, " connection " used herein or " coupling " can include wireless connection or coupling.Wording used herein
"and/or" includes one or more associated any cells for listing item and all combines.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein(Including technology art
Language and scientific terminology)With with art of the present invention in those of ordinary skill general understanding identical meaning.Should also
Understand, those terms defined in such as general dictionary should be understood that the meaning having with the context of prior art
The consistent meaning of justice, and unless defined as here, will not be with idealizing or excessively formal implication be explained.
The present invention proposes a kind of improved BWT data compression methods(Referred to as " BWTS methods ")Including:Islington character
(Lyndon Word)Divide and transposition two parts, specific method is as follows:
1st, Lyndon word most long are divided
Lyndon word were proposed in 1954 by mathematician Roger Lyndon, and referred to as standard word canonical ordering sequence
Row(standard lexicographic sequence).Lyndon Word are such a string of characters, are compared to its all of
Cyclically shifted sequences, its lexcographical order(Dictionary sorts(lexicographical order)It is that one kind forms sequence for stochastic variable
The sort method of row.Its method is, in alphabetical order, or the small big order of numeral, ascending formation sequence.)All it is
Minimum.
Fig. 3 shows the canonical schema that Lyndon Word are divided.As shown in figure 3, the signified Lyndon most long of the present invention
Word is extended backward since the first character of character string, finds Lyndon word most long, most long from this afterwards
The character late of Lyndon word begins look for Lyndon word most long, until end of string.With character string
Lyndon word most long are described as a example by ' banana ' in detail to divide:S=' banana ', first read in character ' b ', and monocase is clearly
Lyndon word, then continually look for the Lyndon word most long started with the character, character ' a ' is then read in again, then now
' ba ' is apparently not Lyndon word, once detect the character string of non-Lyndon word, then the character for being started with character ' b '
String just need not have detected backward again, be that ' b ' is just the Lyndon word most long started with character ' b ' herein.Next step is just
Lyndon word most long are detected since the character late ' a ' of character ' b ', ' a ' is clearly Lyndon word, then is read in
' n ', ' an ' is also Lyndon word, and ' a ' is entered once again, and ' ana ' is not then Lyndon word, and then ' an ' is with character
The Lyndon word most long that ' a ' starts.Next step is detected since the character late ' a ' of Lyndon word- ' an ' most long
Lyndon word most long, identical with last detection, ' an ' is the Lyndon word most long since ' a ' character.Then
Just since the character late ' a ' of the Lyndon word most long, due to end of string, therefore ' a ' is just for herein for next step
Lyndon word most long.Then for input character string S=' banana ', its Lyndon word output just for ' b ', ' an ',
‘an’、‘a’。
2nd, transposition and sequence are shifted
Displacement transposition is carries out same length by each Lyndon word most long that Lyndon word most long divide generation
Treatment.Illustrated by taking S=' banana ' as an example, the result that its most Lyndon word long is divided is ' b ', ' an ', ' an ', ' a ', wherein
Character string most long is ' an ', and length is 2, then the character string for length less than 2 then extends to length 2 by cyclic shift,
I.e. ' b ' is extended for ' bb ', and ' a ' is extended for ' aa ', and is then circulated the square that displacement forms 2*2 for character string ' an ' most long
Battle array, i.e. ' an ', ' na ', then shift transposition and are output as ' bb ', ' an ', ' na ', ' an ', ' na ', ' aa '.
Sequence is the character string that will shift transposition generation with lexcographical order arrangement, and for examples detailed above, ranking results are just
‘aa’、‘an’、‘an’、‘bb’、‘na’、‘na’。
For above-mentioned ranking results, it is just the output result without suffix BWT methods to take last row.For the example,
Output result is L=' annbaa ', the output without suffix constant.
The essence of the method direct transform is divided by Lyndon word, and homing sequence during by inverse transformation is hidden in this
In output sequence, without being shown using suffix constant.
Fig. 4 shows a kind of BWTS methods and its system for implementing hardware of present invention offer, including:Input buffer module, uses
In temporary pending character string, and synchrodata input and data processing, character string is exported to Lyndon Word after having processed
Searching modul;Lyndon Word searching moduls, for searching the Lyndon most long come from input buffer module character string
Word, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each Lyndon most long
The length of Word is exported to Lyndon Word length cache modules;Lyndon Word cache modules, for temporary output certainly
The Lyndon Word of Lyndon Word searching moduls are used for transposition module;Lyndon Word length cache modules, are used for
The length and number of all Lyndon Word found in temporary Lyndon Word searching moduls are used for order module;Transposition
Module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and keeping in transposition transposition result
Cache module;Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;Order module,
For all character strings in transposition cache module to be sorted by lexcographical order, and output of last row as BWTS methods is taken,
And it is temporarily stored in output buffer module;Output buffer module, the character string for keeping in output, uses for subsequent module.
Fig. 5 shows the Lyndon Word searching modul structural representations of the system for implementing hardware of present invention offer.Such as Fig. 5
Shown, the Lyndon Word searching moduls are further included:Character string submodule is taken, for taking word from input buffer module
Symbol, and record the length of now taken character string, reads in since the character string initial character by turn, and often increasing by one is just inputted
Subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, string length input Lyndon Word is long
Degree cache module, length zero setting, next time takes the last character that character string then takes character string since this time;Displacement submodule
Block, for will take character string submodule character string be input into Lyndon Word judging submodules and by the character string carry out by
Secondary displacement, and all shift character strings are input into N*N registers;N*N registers, treating for a bit submodule is come from for storing
Judge all shift characters statements based on collusion Lyndon Word judging submodule treatment of character string;Lyndon Word judging submodules,
For gradually taking out shift character string from N*N registers and being contrasted with former character string, wherein:Contrast number is character to be judged
The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, by the character string
Export to Lyndon Word cache modules.
Fig. 6 shows the transposition modular structure schematic diagram of the system for implementing hardware of present invention offer.As shown in fig. 6, described turn
Module is put to further include:Lyndon Word length most long differentiates submodule, for keeping in mould by Lyndon Word length
Content in block differentiates the length of the Lyndon Word most long of processed character string, and the numerical value is sent into character string extension
Submodule;Character string extends submodule, for all Lyndon Word in Lyndon Word cache modules to be extended to
The length of Lyndon Word most long is used for cyclic shift submodule;Cyclic shift submodule, expands for that will come from character string
The Lyndon Word cyclic shifts successively of submodule are opened up, and is stored to transposition cache module.
Fig. 7 shows the order module structural representation of the system for implementing hardware of present invention offer.As shown in fig. 7, the row
Sequence module includes:Sorting sub-module, for the character string in transposition cache module to be obtained according to lexcographical order sequence for BWTS results
Modulus block is used;BWTS result acquisition modules, for last row of the ranking results of sorting sub-module to be read, as BWTS
The output of method, and keep in output buffer module.
Specific embodiment:By taking character string " icanucan " as an example.
" icanucan " is stored in input buffer 102 first.Take character string submodule 202 and take character " i " input displacement
Module and Lyndon Word judging submodules 208.The length of " i " need not be shifted for 1.Take character string submodule 202 and take word
" ic " input is simultaneously shifted submodule and Lyndon Word judging submodules 208 by symbol " c ".Displacement submodule 204 shifts " ic "
For " ci " and it is stored in N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order ic>Ci, it is clear that ic is not
Lyndon Word." ica " for then extending backward, " ican " ... is not Lyndon Word.Then Lyndon now most long
Word is " i ", and is stored in Lyndon Word cache modules 108.Length 1 is then stored in Lyndon Word length cache modules
106.Word is taken since " c " afterwards, takes that character string submodule 202 takes character " c " input displacement submodule and Lyndon Word sentence
Disconnected submodule 208.The length of " c " need not be shifted for 1.Character string submodule 202 is taken to take character " a " and " ca " is input into shifting
Bit submodule and Lyndon Word judging submodules 208.Displacement submodule 204 is by " ca " displacement is for " ac " and is stored in N*N deposits
Device.
Lyndon Word judging submodules are contrasted, by lexcographical order ca>Ac, it is clear that ca is not Lyndon Word.Then
" ica " for extending backward, " ican " ... is not Lyndon Word.Lyndon Word then now most long are " c ", and are deposited
Enter Lyndon Word cache modules 108.Length 1 is then stored in Lyndon Word length cache module 106.Opened from " a " afterwards
Beginning takes word, takes character string submodule 202 and takes character " a " input displacement submodule and Lyndon Word judging submodules 208.“a”
Length need not be shifted for 1.Take character string submodule 202 take character " n " and by " an " input displacement submodule and
Lyndon Word judging submodules 208.Displacement submodule 204 is by " an " displacement is for " na " and is stored in N*N registers.Lyndon
Word judging submodules are contrasted, by lexcographical order an<Na, it is clear that an is Lyndon Word.But whether it is Lyndon most long
Word needs to continue to judge.Character string submodule 202 is taken to take character " u " and " anu " input is shifted into submodule and Lyndon
Word judging submodules 208." anu " displacement is " nua ", " uan " and is stored in N*N registers by displacement submodule 204.Lyndon
Word judging submodules are contrasted, by lexcographical order anu<nua<Uan, it is clear that anu is Lyndon Word.But whether it is most long
Lyndon Word need to continue to judge.Take character string submodule 202 take character " c " and by " anuc " input displacement submodule and
Lyndon Word judging submodules 208." anuc " displacement is " nuca ", " ucna ", " cnau " and deposited by displacement submodule 204
Enter N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order anuc>cnau>nuca>Ucna, it is clear that
Anuc is Lyndon Word.But whether it is that Lyndon Word most long need to continue to judge.Character string submodule 202 is taken afterwards to take
" anuca " input is simultaneously shifted submodule and Lyndon Word judging submodules 208 by character " a ".Displacement submodule 204 will
" anuca " displacement is " nucaa ", " ucaan ", " caanu ", " aanuc " and is stored in N*N registers.Lyndon Word judge son
Module is contrasted, by lexcographical order anuca>Aanuc, it is clear that anuca is not Lyndon Word.Then extend backward
" anucan " is nor Lyndon Word.Lyndon Word then now most long are " anuc ", and are stored in Lyndon Word
Cache module 108.Length 4 is then stored in Lyndon Word length cache module 106.
Just read in since " a " afterwards, take character string submodule 202 and take character " a " input displacement submodule and Lyndon
Word judging submodules 208.The length of " a " need not be shifted for 1.Take character string submodule 202 and take character " n " and by " an "
Input displacement submodule and Lyndon Word judging submodules 208.Displacement submodule 204 is by " an " displacement is for " na " and is stored in
N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order an<Na, it is clear that an is Lyndon Word.Cause
" an " is stored in Lyndon Word cache modules by this, and length 2 is stored in into Lyndon Word length cache modules.So far it is whole
Sequence Detection is completed.The Lyndon Word most long for now being kept in Lyndon Word cache modules: “i”、“c”、“anuc”、
" an ", the content of storage is in Lyndon Word length cache modules:1、1、4、2.
Enter the transposition stage afterwards.Lyndon Word length most long differentiates that submodule 302 reads in Lyndon Word first
Content 1,1,4,2 in length cache module 106, and inquire maximum therein 4.Character string extends the basis of submodule 304
Lyndon Word length most long differentiates the Lyndon Word length 1,1,4,2 in submodule 302.Gradually read according to length each
Individual Lyndon Word characters most long:“i”、“c”、“anuc”、“an”.It is extended according to maximum 4:" i " is expanded to
" iiii ", " c " expands to " cccc ", and " anuc " length is 4, it is not necessary to extended, " an " is expanded to " anan ".Cyclic shift submodule
Block 306 reads in escape character (ESC) string " iiii ", " cccc ", " anuc ", " anan ".Above sequence is circulated displacement, and is kept in
Each sequence during displacement is in transposition temporary storage module 112.It is identical with former sequence after " iiii " sequential shift, therefore be not required to move
Position.It is identical with former sequence after " cccc " sequential shift, therefore it is not required to displacement." anuc " shift, and keep in " nuca ", " ucan ",
“canu”." anan " is shifted, and keeps in " nana ".Order module 114 read transposition temporary storage module in keep in sequence " iiii ",
“cccc”、“anuc”、“nuca”、“ucan”、“canu”、“anan”、“nana”.Order module 402 is by all sequences according to word
Canonical ordering is arranged, and rank results are:“anan”、“anuc”、“canu”、“cccc”、“iiii”、“nana”、“nuca”、“ucan”.
BWTS results acquisition module 404 takes out the last character of above-mentioned each sequence, constitutes the output of BWTS:
" ncuciaan ", output to output buffer module 116.
Summarize the process as follows:icanucan => [i][c][anuc][an] => [anan] => ncuciaan
[anuc]
[canu]
[cccc]
[iiii]
[nana]
[nuca]
[ucan]
To achieve these goals, the present invention also provides a kind of improved BWT data compression methods, including:Input character
String, keeps in pending character string, and synchrodata input and data processing, by character string after having processed by input buffer module
Export to Lyndon Word searching moduls;Searched by Lyndon Word searching moduls and come from input buffer module character string
In Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, will
The length of each Lyndon Word most long is exported to Lyndon Word length cache modules;Mould is cached by Lyndon Word
Block is kept in output and is used for transposition module from the Lyndon Word of Lyndon Word searching moduls;It is long by Lyndon Word
The length and number supply and discharge sequence mould of all Lyndon Word found in the temporary Lyndon Word searching moduls of degree cache module
Block is used;The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and by transposition knot
Fruit is kept in transposition cache module;The transposition result exported by the temporary transposition module of transposition cache module is made for order module
With;The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS
The output of method, and it is temporarily stored in output buffer module;The character string of output is kept in by output buffer module, is made for subsequent module
With.
The Lyndon Word of the searching data block are further included:Character is taken from input buffer module, and is recorded now
The length of taken character string, is read in since the character string initial character by turn, and often increasing by one and being just inputted subsequent module is carried out
Lyndon Word judge, if there are Lyndon Word, by string length input Lyndon Word length cache module,
Length zero setting, next time takes the last character that character string then takes character string since this time;The word of character string submodule will be taken
Symbol string is input into Lyndon Word judging submodules and is gradually shifted the character string, and all shift character strings are defeated
Enter N*N registers;Come from all shift characters statements based on collusion of the character string to be judged of a bit submodule by N*N register storages
The treatment of Lyndon Word judging submodules;Shift character string is gradually taken out from N*N registers and is contrasted with former character string, its
In:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is
Lyndon Word, the character string is exported to Lyndon Word cache modules.
The transposition for completing all Lyndon Word is further included:By in Lyndon Word length temporary storage modules
Content differentiate processed character string Lyndon Word most long length, and the numerical value is sent to character string extension submodule
Block;All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following
Ring displacement submodule is used;The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to turning
Put cache module.
All character strings that transposition is completed are further included by lexcographical order sequence:By the word in transposition cache module
Symbol string is used according to lexcographical order sequence for BWTS result acquisition modules;Last row of the ranking results of sorting sub-module are read
Go out, as the output of BWTS methods, and keep in output buffer module.
Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, can change existing BWT methods
The constant that must be generated by direct transform could realize the situation that character string is recovered.This change on the one hand should in data compression
Compression ratio can be improved in, the blocking treatment of subsequent step of being especially more convenient in terms of the hardware realization of data compression.Remove
Outside this, BWT algorithms are often used as channel coding, but if mistake occurs in the transmitting procedure of strong noise in the constant of its generation
Miss or lose, whole character string will be caused to recover, and this algorithm can solve the problem:The inversion of improved B WT algorithms
Always since zero character, one of character errors will only influence one in whole character string to the starting point changed
Or two characters, whole character string will not be impacted.For example in examples detailed above, ' icanucan ' is calculated by improved B WT
Be output as ' ncuciaan ' after method to occur mistake in the transmission is just ' ncuchaan ', then by generating character string after inverse transformation
There is mistake in ' icanhcan ', only one of which character, greatly reduces error rate.
During those skilled in the art of the present technique are appreciated that the present invention can be related to for performing operation described herein
One or more equipment of operation.The equipment can be for needed for purpose and specially design and manufacture, or can also include
Known device in all-purpose computer, the all-purpose computer is activated or reconstructed with having procedure Selection of the storage in it.This
The computer program of sample can be stored in equipment(For example, computer)In computer-readable recording medium or storage be suitable to storage electronics refer to
Make and be coupled to respectively in any kind of medium of bus, the computer-readable medium is including but not limited to any kind of
Disk(Including floppy disk, hard disk, CD, CD-ROM and magneto-optic disk), memory immediately(RAM), read-only storage(ROM), electricity can compile
Journey ROM, electrically erasable ROM(EPROM), electrically erasable ROM(EEPROM), flash memory, magnetic card or light card.It is readable
Medium includes being used for by equipment(For example, computer)Readable form storage or any mechanism of transmission information.For example, readable
Medium includes memory immediately(RAM), read-only storage(ROM), magnetic disk storage medium, optical storage medium, flash memory device, with
The signal that electricity, light, sound or other forms are propagated(Such as carrier wave, infrared signal, data signal)Deng.
Those skilled in the art of the present technique be appreciated that can be realized with computer program instructions these structure charts and/or
The combination of the frame in each frame and these structure charts and/or block diagram and/or flow graph in block diagram and/or flow graph.Can be by this
A little computer program instructions are supplied to the processor of all-purpose computer, special purpose computer or other programmable data processing methods
Generation machine, so as to the instruction that is performed by the processor of computer or other programmable data processing methods create for
The method specified in the frame or multiple frames of realizing structure chart and/or block diagram and/or flow graph.
Those skilled in the art of the present technique are appreciated that in various operations, method, the flow discussed in the present invention
Step, measure, scheme can be replaced, changed, combined or deleted.Further, it is each with what is discussed in the present invention
Other steps, measure in kind operation, method, flow, scheme can also be replaced, changed, reset, decomposed, combined or deleted.
Further, it is of the prior art with various operations, method, the flow disclosed in the present invention in step, measure, scheme
Can also be replaced, changed, reset, decomposed, combined or deleted.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (8)
1. a kind of system for implementing hardware of improved BWT data compression methods, it is characterised in that including:
Input buffer module, for keeping in pending character string, and synchrodata input and data processing, by character after having processed
String is exported to Lyndon Word searching moduls;
Lyndon Word searching moduls, for searching the Lyndon Word most long come from input buffer module character string,
And the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each Lyndon Word's most long
Length is exported to Lyndon Word length cache modules;
Lyndon Word cache modules, transposition is supplied for temporary output from the Lyndon Word of Lyndon Word searching moduls
Module is used;
Lyndon Word length cache modules, for keeping in all Lyndon found in Lyndon Word searching moduls
The length and number of Word are used for order module;
Transposition module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and transposition result is temporary
Deposit to transposition cache module;
Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;
Order module, for all character strings in transposition cache module to be sorted by lexcographical order, and takes last row conduct
The output of BWTS methods, and it is temporarily stored in output buffer module;
Output buffer module, the character string for keeping in output, uses for subsequent module.
2. system according to claim 1, it is characterised in that the Lyndon Word searching moduls are further included:
Character string submodule is taken, for taking character from input buffer module, and the length of now taken character string is recorded, from character
String initial character starts to read in by turn, and often increase by is just inputted subsequent module carries out Lyndon Word judgements, if occurring
Lyndon Word, then by string length input Lyndon Word length cache module, length zero setting, take character string next time then
The last character of character string is taken since this time;
Displacement submodule, the character string for will take character string submodule is input into Lyndon Word judging submodules and should
Character string is gradually shifted, and all shift character strings are input into N*N registers;
N*N registers, all shift characters statements based on collusion Lyndon for storing the character string to be judged for coming from displacement submodule
The treatment of Word judging submodules;
Lyndon Word judging submodules, for gradually from N*N registers take out shift character string and with former character string pair
Than, wherein:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the word
Symbol string is Lyndon Word, and the length of each Lyndon Word most long is exported to Lyndon Word length cache modules.
3. system according to claim 1, it is characterised in that the transposition module is further included:
Lyndon Word length most long differentiates submodule, for being differentiated by the content in Lyndon Word length temporary storage modules
The length of the Lyndon Word most long of processed character string, and the length is sent to character string extension submodule;
Character string extends submodule, most long for all Lyndon Word in Lyndon Word cache modules to be extended to
The length of Lyndon Word is used for cyclic shift submodule;
Cyclic shift submodule, for will come from the Lyndon Word cyclic shifts successively of character string extension submodule, and stores up
Deposit to transposition cache module.
4. system according to claim 1, it is characterised in that the order module includes:
Sorting sub-module, for the character string in transposition cache module to be made according to lexcographical order sequence for BWTS result acquisition modules
With;
BWTS result acquisition modules, for last row of the ranking results of sorting sub-module to be read, as BWTS methods
Output, and keep in output buffer module.
5. a kind of improved BWT data compression methods, it is characterised in that including:
Input character string, pending character string, and synchrodata input and data processing, treatment are kept in by input buffer module
Character string is exported to Lyndon Word searching moduls after complete;
The Lyndon Word most long come from input buffer module character string are searched by Lyndon Word searching moduls, and
The Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by the length of each Lyndon Word most long
Degree is exported to Lyndon Word length cache modules;
Output is kept in by Lyndon Word cache modules and supplies transposition mould from the Lyndon Word of Lyndon Word searching moduls
Block is used;
By all Lyndon Word found in the temporary Lyndon Word searching moduls of Lyndon Word length cache module
Length and number used for order module;
The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and transposition result is temporary
Deposit to transposition cache module;
The transposition result exported by the temporary transposition module of transposition cache module is used for order module;
The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS
The output of method, and it is temporarily stored in output buffer module;
The character string of output is kept in by output buffer module, is used for subsequent module.
6. method according to claim 5, it is characterised in that the Lyndon Word searching moduls are further included:
Character is taken from input buffer module, and record the length of now taken character string, read by turn since the character string initial character
Enter, often increase by is just inputted subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, by character
String length input Lyndon Word length cache module, length zero setting, take character string and then take the last of character string from this time next time
One character starts;
The character string input Lyndon Word judging submodules of character string submodule will be taken and the character string is gradually moved
Position, and all shift character strings are input into N*N registers;
Come from all shift characters statements based on collusion Lyndon of the character string to be judged of displacement submodule by N*N register storages
The treatment of Word judging submodules;
Shift character string is gradually taken out from N*N registers and is contrasted with former character string, wherein:Contrast number is character to be judged
The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, and each is most long
The length of Lyndon Word is exported to Lyndon Word length cache modules.
7. method according to claim 5, it is characterised in that the transposition of all Lyndon Word of completion is further
Including:
The length of the Lyndon Word most long of processed character string is differentiated by the content in Lyndon Word length temporary storage modules
Degree, and the length is sent to character string extension submodule;
All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following
Ring displacement submodule is used;
The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to transposition cache module.
8. method according to claim 5, it is characterised in that all character strings for completing transposition are arranged by lexcographical order
Sequence is further included:
Character string in transposition cache module is used according to lexcographical order sequence for BWTS result acquisition modules;
Last row of the ranking results of sorting sub-module are read, as the output of BWTS methods, and is kept in output caching
Module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410571262.3A CN104284189B (en) | 2014-10-23 | 2014-10-23 | A kind of improved BWT data compression methods and its system for implementing hardware |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410571262.3A CN104284189B (en) | 2014-10-23 | 2014-10-23 | A kind of improved BWT data compression methods and its system for implementing hardware |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104284189A CN104284189A (en) | 2015-01-14 |
CN104284189B true CN104284189B (en) | 2017-06-16 |
Family
ID=52258598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410571262.3A Active CN104284189B (en) | 2014-10-23 | 2014-10-23 | A kind of improved BWT data compression methods and its system for implementing hardware |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104284189B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005464B (en) * | 2015-07-02 | 2017-10-10 | 东南大学 | A kind of Burrows Wheeler mapping hardware processing units |
CN107342102B (en) * | 2016-04-29 | 2021-04-27 | 上海磁宇信息科技有限公司 | MRAM chip with search function and search method |
CN116821967B (en) * | 2023-08-30 | 2023-11-21 | 山东远联信息科技有限公司 | Intersection computing method and system for privacy protection |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6674908B1 (en) * | 2002-05-04 | 2004-01-06 | Edward Lasar Aronov | Method of compression of binary data with a random number generator |
CN103117748A (en) * | 2013-01-29 | 2013-05-22 | 中国科学院计算技术研究所 | Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method |
CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130019029A1 (en) * | 2011-07-13 | 2013-01-17 | International Business Machines Corporation | Lossless compression of a predictive data stream having mixed data types |
-
2014
- 2014-10-23 CN CN201410571262.3A patent/CN104284189B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6674908B1 (en) * | 2002-05-04 | 2004-01-06 | Edward Lasar Aronov | Method of compression of binary data with a random number generator |
CN103810228A (en) * | 2012-11-01 | 2014-05-21 | 辉达公司 | System, method, and computer program product for parallel reconstruction of a sampled suffix array |
CN103117748A (en) * | 2013-01-29 | 2013-05-22 | 中国科学院计算技术研究所 | Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method |
Non-Patent Citations (2)
Title |
---|
On Two-Dimensional Lyndon Words;S Marcus;《International Symposium on String Processing and Information Retrieval》;20131009;全文 * |
高速数据压缩与缓存的FPGA实现;王宁;《微计算机信息》;20080603;第24卷(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104284189A (en) | 2015-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Adjeroh et al. | The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching | |
US5371499A (en) | Data compression using hashing | |
KR101956031B1 (en) | Data compressor, memory system comprising the compress and method for compressing data | |
CN110428868B (en) | Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data | |
CN107066837B (en) | Method and system for compressing reference DNA sequence | |
EP0577330A2 (en) | Improved variable length decoder | |
CN104284189B (en) | A kind of improved BWT data compression methods and its system for implementing hardware | |
EP2791854A2 (en) | Counter operation in a state machine lattice | |
CN100525450C (en) | Method and device for realizing Hoffman decodeng | |
US20130019029A1 (en) | Lossless compression of a predictive data stream having mixed data types | |
US20060022848A1 (en) | Arithmetic code decoding method and apparatus | |
Chen et al. | A high-throughput FPGA accelerator for short-read mapping of the whole human genome | |
US4802108A (en) | Circuit for providing a select rank-order number from a plurality of numbers | |
CN103746706A (en) | Testing data compressing and decompressing method on basis of double-run-length alternate coding | |
US7764205B2 (en) | Decompressing dynamic huffman coded bit streams | |
WO2024066561A1 (en) | Apparatus and method for searching for free memory and chip | |
Arming et al. | Data compression in hardware—the burrows-wheeler approach | |
Hayfron-Acquah et al. | Improved selection sort algorithm | |
CN100546200C (en) | Be used for method, decoder, system and equipment from the bitstream decoding codewords of variable length | |
US10084477B2 (en) | Method and apparatus for adaptive data compression | |
EP1290542A2 (en) | Determination of a minimum or maximum value in a set of data | |
US20040186977A1 (en) | Method and apparatus for finding repeated substrings in pattern recognition | |
CN1098565C (en) | Method and apparatus for decoding variable length code | |
JP2007274051A (en) | Byte sequence searcher and searching method | |
CN111443891B (en) | Variable-length merging and sorting implementation method for electric power internet of things data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |