CN104284189B - A kind of improved BWT data compression methods and its system for implementing hardware - Google Patents

A kind of improved BWT data compression methods and its system for implementing hardware Download PDF

Info

Publication number
CN104284189B
CN104284189B CN201410571262.3A CN201410571262A CN104284189B CN 104284189 B CN104284189 B CN 104284189B CN 201410571262 A CN201410571262 A CN 201410571262A CN 104284189 B CN104284189 B CN 104284189B
Authority
CN
China
Prior art keywords
lyndon
word
character string
module
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410571262.3A
Other languages
Chinese (zh)
Other versions
CN104284189A (en
Inventor
李冰
陈帅
董乾
刘勇
赵霞
王刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201410571262.3A priority Critical patent/CN104284189B/en
Publication of CN104284189A publication Critical patent/CN104284189A/en
Application granted granted Critical
Publication of CN104284189B publication Critical patent/CN104284189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention provides the present invention and provides a kind of system for implementing hardware of BWTS methods, including:Input buffer module, for keeping in pending character string, and synchrodata input and data processing;Lyndon Word searching moduls, the Lyndon Word of searching data block;Lyndon Word cache modules, keep in Lyndon Word;Lyndon Word length cache modules, keep in the length and number of all Lyndon Word;Transposition module, completes the transposition of all Lyndon Word;Transposition cache module, keeps in transposition result;Order module, all character strings that transposition is completed sort by lexcographical order, and take output of last row as BWTS methods;Output buffer module, keeps in the character string of output, is used for subsequent module.Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, the constant that can change existing BWT methods must be generated by direct transform could realize the situation that character string is recovered, to improve the operational efficiency of data compression method.

Description

A kind of improved BWT data compression methods and its system for implementing hardware
Technical field
The present invention relates to technical field of data compression, more particularly to a kind of improved BWT data compression methods and its hardware Realize system.
Background technology
Data compression technique is always the study hotspot of information science, and it has widely in terms of data storage and transmission Using.Although data storage device capacity constantly expands and network transfer speeds are improved constantly, but the diversity of data and quick-fried Hairdo growth property so that efficient compression method turns into the important means for effectively reducing storage and transmission cost.Data compression point It is Lossless Compression and lossy compression method.Lossy compression method allows a certain degree of information to lose, and is passed in multimedia interactive system, video It is used widely in the field such as defeated business and home entertaining.Lossless Compression is the reversible encoding based on information entropy principle, in not shadow The redundancy in information source is removed on the premise of ringing comentropy, the information after compression can be reduced, its remote sensing image processing, Medical imaging treatment, history archive are all widely used in preserving the fields such as analysis and many mixed image compression methods.Most The removal redundancy of limits is the target that Lossless Compression is pursued.At present to compression method performance evaluation refer mainly to indicate pressure Contracting ratio and compression speed etc..BWT is the transformation idea that MikeBurrows is proposed according to DavidWheeler, improves and successfully should For the transform method of real data compression, the conversion is the study hotspot in current Lossless Compression field.BWT is one kind with data Block is the reversible data conversion method of operation object, and its core concept is that the character matrix obtained after being rotated to character string is carried out Sequence and conversion.Itself will not reduce data volume, but the data after conversion are easier to compression, so BWT is that data are entered Pretreatment before row compression.
Fig. 1 shows a kind of Bzip2 data compression systems of efficiently increasing income based on BWT methods of the prior art.Such as Fig. 1 Shown, character string S occurs continuous identical characters after BWT methods, is processed through MTF methods after, and the result for obtaining will be Continuous 0 and a series of small integer, for further reducing overall entropy;Finally encoded with cum rights path using Huffman The form of minimum binary tree carries out data compression, obtains compression ratio higher.Further, since the similitude of BWT and suffix array, So that BWT is used as the string matching in FM-index methods.It is stronger interior that BWT methods cause that data in character block occur Poly- property, i.e., identical characters condense together, and this feature causes that follow-up compression method has more preferable compression ratio.The method changes The limitation that compression method must be processed with data flow model is become so that be treated as can for character block in compression method Can, this is the revolutionary progress in Lossless Compression field.In addition, BWT methods are also applied to bioinformatics, for full base Because of group comparison, the range measurement between genome annotation and two genome sequences.Channel is often used as among communication system Coding.
Fig. 2 shows the data compression schematic diagram based on BWT data compression methods of the prior art.As shown in Fig. 2 under Figure illustrates the general principle of BWT methods, and the block for realizing data with BWT methods is processed.Assuming that character string of the input length for n (Block)S=' ABRACA ', by character string S cyclic shifts formed n*n matrix M, to M in every a line according to lexcographical order sequence, structure Make matrix Q.Last row for taking Q are just output sequence L=' CARAAB ', position of the source string in Q(Line number)Just it is output Constant index=1.However, among numerous applications, BWT methods are often used as early stage treatment, it is the character of n for length String, it is the character string and a constant of n that length is formed after being processed through BWT methods.The presence of the constant is to many subsequent treatment bands It is inconvenient to come.For example when BWT is used for channel coding, due to influence of noise, if the constant is lost, the character string will be unable to extensive It is multiple.It is the character string of n for length when for Lossless Compression, after treatment, the character string that length is n+1 can be changed into, because This changes the entropy of character string.At present, not yet find both at home and abroad for the research without suffix BWT methods
In view of this, the problem for existing for current BWT methods, it is necessary to propose a kind of improved BWT data compression sides Method, the constant that can change existing BWT methods must be generated by direct transform could realize the situation that character string is recovered, to carry The operational efficiency of data compression method high.
The content of the invention
In order to overcome the weak point of the prior art of above-mentioned meaning, the present invention is directed to propose a kind of improved BWT numbers According to compression method, the constant that can change existing BWT methods must be generated by direct transform could realize what character string was recovered Situation, to improve the operational efficiency of data compression method.
To achieve these goals, the present invention provides a kind of BWTS methods and its system for implementing hardware, including:Input-buffer Module, for keeping in pending character string, and synchrodata input and data processing, after having processed by character string export to Lyndon Word searching moduls;Lyndon Word searching moduls, for searching and coming from input buffer module character string in Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each most The length of Lyndon Word long is exported to Lyndon Word length cache modules;Lyndon Word cache modules, for keeping in Output is used from the Lyndon Word of Lyndon Word searching moduls for transposition module;Lyndon Word length caches mould Block, length and number for keeping in all Lyndon Word found in Lyndon Word searching moduls make for order module With;Transposition module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and transposition result is temporary Deposit to transposition cache module;Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;Row Sequence module, for all character strings in transposition cache module to be sorted by lexcographical order, and takes last row as BWTS methods Output, and be temporarily stored in output buffer module;Output buffer module, the character string for keeping in output, uses for subsequent module.
The Lyndon Word searching moduls are further included:Character string submodule is taken, for being taken from input buffer module Character, and record the length of now taken character string, reads in since the character string initial character by turn, often increase by one it is just that its is defeated Entering subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, by string length input Lyndon Word Length cache module, length zero setting, next time takes the last character that character string then takes character string since this time;Displacement submodule Block, for will take character string submodule character string be input into Lyndon Word judging submodules and by the character string carry out by Secondary displacement, and all shift character strings are input into N*N registers;N*N registers, treating for a bit submodule is come from for storing Judge all shift characters statements based on collusion Lyndon Word judging submodule treatment of character string;Lyndon Word judging submodules, For gradually taking out shift character string from N*N registers and being contrasted with former character string, wherein:Contrast number is character to be judged The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, by the character string Export to Lyndon Word cache modules.
The transposition module is further included:Lyndon Word length most long differentiates submodule, for by Lyndon Content in Word length temporary storage modules differentiates the length of the Lyndon Word most long of processed character string, and the numerical value is passed Deliver to character string extension submodule;Character string extends submodule, for by all Lyndon in Lyndon Word cache modules The length that Word extends to Lyndon Word most long is used for cyclic shift submodule;Cyclic shift submodule, in the future From the Lyndon Word in character string extension submodule successively cyclic shift, and store to transposition cache module.
The order module includes:Sorting sub-module, for the character string in transposition cache module to be arranged according to lexcographical order Sequence is used for BWTS result acquisition modules;BWTS result acquisition modules, for by the ranking results of sorting sub-module last Row read, and as the output of BWTS methods, and keep in output buffer module.
To achieve these goals, the present invention also provides a kind of improved BWT data compression methods, including:Input character String, keeps in pending character string, and synchrodata input and data processing, by character string after having processed by input buffer module Export to Lyndon Word searching moduls;Searched by Lyndon Word searching moduls and come from input buffer module character string In Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, will The length of each Lyndon Word most long is exported to Lyndon Word length cache modules;Mould is cached by Lyndon Word Block is kept in output and is used for transposition module from the Lyndon Word of Lyndon Word searching moduls;It is long by Lyndon Word The length and number supply and discharge sequence mould of all Lyndon Word found in the temporary Lyndon Word searching moduls of degree cache module Block is used;The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and by transposition knot Fruit is kept in transposition cache module;The transposition result exported by the temporary transposition module of transposition cache module is made for order module With;The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS The output of method, and it is temporarily stored in output buffer module;The character string of output is kept in by output buffer module, is made for subsequent module With.
The Lyndon Word of the searching data block are further included:Character is taken from input buffer module, and is recorded now The length of taken character string, is read in since the character string initial character by turn, and often increasing by one and being just inputted subsequent module is carried out Lyndon Word judge, if there are Lyndon Word, by string length input Lyndon Word length cache module, Length zero setting, next time takes the last character that character string then takes character string since this time;The word of character string submodule will be taken Symbol string is input into Lyndon Word judging submodules and is gradually shifted the character string, and all shift character strings are defeated Enter N*N registers;Come from all shift characters statements based on collusion of the character string to be judged of a bit submodule by N*N register storages The treatment of Lyndon Word judging submodules;Shift character string is gradually taken out from N*N registers and is contrasted with former character string, its In:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, the character string is exported to Lyndon Word cache modules.
The transposition for completing all Lyndon Word is further included:By in Lyndon Word length temporary storage modules Content differentiate processed character string Lyndon Word most long length, and the numerical value is sent to character string extension submodule Block;All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following Ring displacement submodule is used;The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to turning Put cache module.
All character strings that transposition is completed are further included by lexcographical order sequence:By the word in transposition cache module Symbol string is used according to lexcographical order sequence for BWTS result acquisition modules;Last row of the ranking results of sorting sub-module are read Go out, as the output of BWTS methods, and keep in output buffer module.
Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, can change existing BWT methods The constant that must be generated by direct transform could realize the situation that character string is recovered, and be imitated with the operation for improving data compression method Rate.
The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description Obtain substantially, or recognized by practice of the invention.
Brief description of the drawings
Fig. 1 shows a kind of Bzip2 data compression systems of efficiently increasing income based on BWT methods of the prior art;
Fig. 2 shows the data compression schematic diagram based on BWT data compression methods of the prior art;
Fig. 3 shows the canonical schema that Lyndon Word are divided;
Fig. 4 shows the system for implementing hardware of a kind of improved BWT data compression methods of present invention offer;
Fig. 5 shows the Lyndon Word searching modul structural representations of the system for implementing hardware of present invention offer;
Fig. 6 shows the transposition modular structure schematic diagram of the system for implementing hardware of present invention offer;
Fig. 7 shows the order module structural representation of the system for implementing hardware of present invention offer.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings, wherein ad initio Same or similar element or element with same or like function are represented to same or similar label eventually.Below by ginseng The implementation method for examining Description of Drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " " used herein, " one It is individual ", " described " and " being somebody's turn to do " may also comprise plural form.It is to be further understood that what is used in specification of the invention arranges Diction " including " refer to the presence of the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition One or more other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim unit Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or can also exist Intermediary element.Additionally, " connection " used herein or " coupling " can include wireless connection or coupling.Wording used herein "and/or" includes one or more associated any cells for listing item and all combines.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein(Including technology art Language and scientific terminology)With with art of the present invention in those of ordinary skill general understanding identical meaning.Should also Understand, those terms defined in such as general dictionary should be understood that the meaning having with the context of prior art The consistent meaning of justice, and unless defined as here, will not be with idealizing or excessively formal implication be explained.
The present invention proposes a kind of improved BWT data compression methods(Referred to as " BWTS methods ")Including:Islington character (Lyndon Word)Divide and transposition two parts, specific method is as follows:
1st, Lyndon word most long are divided
Lyndon word were proposed in 1954 by mathematician Roger Lyndon, and referred to as standard word canonical ordering sequence Row(standard lexicographic sequence).Lyndon Word are such a string of characters, are compared to its all of Cyclically shifted sequences, its lexcographical order(Dictionary sorts(lexicographical order)It is that one kind forms sequence for stochastic variable The sort method of row.Its method is, in alphabetical order, or the small big order of numeral, ascending formation sequence.)All it is Minimum.
Fig. 3 shows the canonical schema that Lyndon Word are divided.As shown in figure 3, the signified Lyndon most long of the present invention Word is extended backward since the first character of character string, finds Lyndon word most long, most long from this afterwards The character late of Lyndon word begins look for Lyndon word most long, until end of string.With character string Lyndon word most long are described as a example by ' banana ' in detail to divide:S=' banana ', first read in character ' b ', and monocase is clearly Lyndon word, then continually look for the Lyndon word most long started with the character, character ' a ' is then read in again, then now ' ba ' is apparently not Lyndon word, once detect the character string of non-Lyndon word, then the character for being started with character ' b ' String just need not have detected backward again, be that ' b ' is just the Lyndon word most long started with character ' b ' herein.Next step is just Lyndon word most long are detected since the character late ' a ' of character ' b ', ' a ' is clearly Lyndon word, then is read in ' n ', ' an ' is also Lyndon word, and ' a ' is entered once again, and ' ana ' is not then Lyndon word, and then ' an ' is with character The Lyndon word most long that ' a ' starts.Next step is detected since the character late ' a ' of Lyndon word- ' an ' most long Lyndon word most long, identical with last detection, ' an ' is the Lyndon word most long since ' a ' character.Then Just since the character late ' a ' of the Lyndon word most long, due to end of string, therefore ' a ' is just for herein for next step Lyndon word most long.Then for input character string S=' banana ', its Lyndon word output just for ' b ', ' an ', ‘an’、‘a’。
2nd, transposition and sequence are shifted
Displacement transposition is carries out same length by each Lyndon word most long that Lyndon word most long divide generation Treatment.Illustrated by taking S=' banana ' as an example, the result that its most Lyndon word long is divided is ' b ', ' an ', ' an ', ' a ', wherein Character string most long is ' an ', and length is 2, then the character string for length less than 2 then extends to length 2 by cyclic shift, I.e. ' b ' is extended for ' bb ', and ' a ' is extended for ' aa ', and is then circulated the square that displacement forms 2*2 for character string ' an ' most long Battle array, i.e. ' an ', ' na ', then shift transposition and are output as ' bb ', ' an ', ' na ', ' an ', ' na ', ' aa '.
Sequence is the character string that will shift transposition generation with lexcographical order arrangement, and for examples detailed above, ranking results are just ‘aa’、‘an’、‘an’、‘bb’、‘na’、‘na’。
For above-mentioned ranking results, it is just the output result without suffix BWT methods to take last row.For the example, Output result is L=' annbaa ', the output without suffix constant.
The essence of the method direct transform is divided by Lyndon word, and homing sequence during by inverse transformation is hidden in this In output sequence, without being shown using suffix constant.
Fig. 4 shows a kind of BWTS methods and its system for implementing hardware of present invention offer, including:Input buffer module, uses In temporary pending character string, and synchrodata input and data processing, character string is exported to Lyndon Word after having processed Searching modul;Lyndon Word searching moduls, for searching the Lyndon most long come from input buffer module character string Word, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each Lyndon most long The length of Word is exported to Lyndon Word length cache modules;Lyndon Word cache modules, for temporary output certainly The Lyndon Word of Lyndon Word searching moduls are used for transposition module;Lyndon Word length cache modules, are used for The length and number of all Lyndon Word found in temporary Lyndon Word searching moduls are used for order module;Transposition Module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and keeping in transposition transposition result Cache module;Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;Order module, For all character strings in transposition cache module to be sorted by lexcographical order, and output of last row as BWTS methods is taken, And it is temporarily stored in output buffer module;Output buffer module, the character string for keeping in output, uses for subsequent module.
Fig. 5 shows the Lyndon Word searching modul structural representations of the system for implementing hardware of present invention offer.Such as Fig. 5 Shown, the Lyndon Word searching moduls are further included:Character string submodule is taken, for taking word from input buffer module Symbol, and record the length of now taken character string, reads in since the character string initial character by turn, and often increasing by one is just inputted Subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, string length input Lyndon Word is long Degree cache module, length zero setting, next time takes the last character that character string then takes character string since this time;Displacement submodule Block, for will take character string submodule character string be input into Lyndon Word judging submodules and by the character string carry out by Secondary displacement, and all shift character strings are input into N*N registers;N*N registers, treating for a bit submodule is come from for storing Judge all shift characters statements based on collusion Lyndon Word judging submodule treatment of character string;Lyndon Word judging submodules, For gradually taking out shift character string from N*N registers and being contrasted with former character string, wherein:Contrast number is character to be judged The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, by the character string Export to Lyndon Word cache modules.
Fig. 6 shows the transposition modular structure schematic diagram of the system for implementing hardware of present invention offer.As shown in fig. 6, described turn Module is put to further include:Lyndon Word length most long differentiates submodule, for keeping in mould by Lyndon Word length Content in block differentiates the length of the Lyndon Word most long of processed character string, and the numerical value is sent into character string extension Submodule;Character string extends submodule, for all Lyndon Word in Lyndon Word cache modules to be extended to The length of Lyndon Word most long is used for cyclic shift submodule;Cyclic shift submodule, expands for that will come from character string The Lyndon Word cyclic shifts successively of submodule are opened up, and is stored to transposition cache module.
Fig. 7 shows the order module structural representation of the system for implementing hardware of present invention offer.As shown in fig. 7, the row Sequence module includes:Sorting sub-module, for the character string in transposition cache module to be obtained according to lexcographical order sequence for BWTS results Modulus block is used;BWTS result acquisition modules, for last row of the ranking results of sorting sub-module to be read, as BWTS The output of method, and keep in output buffer module.
Specific embodiment:By taking character string " icanucan " as an example.
" icanucan " is stored in input buffer 102 first.Take character string submodule 202 and take character " i " input displacement Module and Lyndon Word judging submodules 208.The length of " i " need not be shifted for 1.Take character string submodule 202 and take word " ic " input is simultaneously shifted submodule and Lyndon Word judging submodules 208 by symbol " c ".Displacement submodule 204 shifts " ic " For " ci " and it is stored in N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order ic>Ci, it is clear that ic is not Lyndon Word." ica " for then extending backward, " ican " ... is not Lyndon Word.Then Lyndon now most long Word is " i ", and is stored in Lyndon Word cache modules 108.Length 1 is then stored in Lyndon Word length cache modules 106.Word is taken since " c " afterwards, takes that character string submodule 202 takes character " c " input displacement submodule and Lyndon Word sentence Disconnected submodule 208.The length of " c " need not be shifted for 1.Character string submodule 202 is taken to take character " a " and " ca " is input into shifting Bit submodule and Lyndon Word judging submodules 208.Displacement submodule 204 is by " ca " displacement is for " ac " and is stored in N*N deposits Device.
Lyndon Word judging submodules are contrasted, by lexcographical order ca>Ac, it is clear that ca is not Lyndon Word.Then " ica " for extending backward, " ican " ... is not Lyndon Word.Lyndon Word then now most long are " c ", and are deposited Enter Lyndon Word cache modules 108.Length 1 is then stored in Lyndon Word length cache module 106.Opened from " a " afterwards Beginning takes word, takes character string submodule 202 and takes character " a " input displacement submodule and Lyndon Word judging submodules 208.“a” Length need not be shifted for 1.Take character string submodule 202 take character " n " and by " an " input displacement submodule and Lyndon Word judging submodules 208.Displacement submodule 204 is by " an " displacement is for " na " and is stored in N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order an<Na, it is clear that an is Lyndon Word.But whether it is Lyndon most long Word needs to continue to judge.Character string submodule 202 is taken to take character " u " and " anu " input is shifted into submodule and Lyndon Word judging submodules 208." anu " displacement is " nua ", " uan " and is stored in N*N registers by displacement submodule 204.Lyndon Word judging submodules are contrasted, by lexcographical order anu<nua<Uan, it is clear that anu is Lyndon Word.But whether it is most long Lyndon Word need to continue to judge.Take character string submodule 202 take character " c " and by " anuc " input displacement submodule and Lyndon Word judging submodules 208." anuc " displacement is " nuca ", " ucna ", " cnau " and deposited by displacement submodule 204 Enter N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order anuc>cnau>nuca>Ucna, it is clear that Anuc is Lyndon Word.But whether it is that Lyndon Word most long need to continue to judge.Character string submodule 202 is taken afterwards to take " anuca " input is simultaneously shifted submodule and Lyndon Word judging submodules 208 by character " a ".Displacement submodule 204 will " anuca " displacement is " nucaa ", " ucaan ", " caanu ", " aanuc " and is stored in N*N registers.Lyndon Word judge son Module is contrasted, by lexcographical order anuca>Aanuc, it is clear that anuca is not Lyndon Word.Then extend backward " anucan " is nor Lyndon Word.Lyndon Word then now most long are " anuc ", and are stored in Lyndon Word Cache module 108.Length 4 is then stored in Lyndon Word length cache module 106.
Just read in since " a " afterwards, take character string submodule 202 and take character " a " input displacement submodule and Lyndon Word judging submodules 208.The length of " a " need not be shifted for 1.Take character string submodule 202 and take character " n " and by " an " Input displacement submodule and Lyndon Word judging submodules 208.Displacement submodule 204 is by " an " displacement is for " na " and is stored in N*N registers.Lyndon Word judging submodules are contrasted, by lexcographical order an<Na, it is clear that an is Lyndon Word.Cause " an " is stored in Lyndon Word cache modules by this, and length 2 is stored in into Lyndon Word length cache modules.So far it is whole Sequence Detection is completed.The Lyndon Word most long for now being kept in Lyndon Word cache modules: “i”、“c”、“anuc”、 " an ", the content of storage is in Lyndon Word length cache modules:1、1、4、2.
Enter the transposition stage afterwards.Lyndon Word length most long differentiates that submodule 302 reads in Lyndon Word first Content 1,1,4,2 in length cache module 106, and inquire maximum therein 4.Character string extends the basis of submodule 304 Lyndon Word length most long differentiates the Lyndon Word length 1,1,4,2 in submodule 302.Gradually read according to length each Individual Lyndon Word characters most long:“i”、“c”、“anuc”、“an”.It is extended according to maximum 4:" i " is expanded to " iiii ", " c " expands to " cccc ", and " anuc " length is 4, it is not necessary to extended, " an " is expanded to " anan ".Cyclic shift submodule Block 306 reads in escape character (ESC) string " iiii ", " cccc ", " anuc ", " anan ".Above sequence is circulated displacement, and is kept in Each sequence during displacement is in transposition temporary storage module 112.It is identical with former sequence after " iiii " sequential shift, therefore be not required to move Position.It is identical with former sequence after " cccc " sequential shift, therefore it is not required to displacement." anuc " shift, and keep in " nuca ", " ucan ", “canu”." anan " is shifted, and keeps in " nana ".Order module 114 read transposition temporary storage module in keep in sequence " iiii ", “cccc”、“anuc”、“nuca”、“ucan”、“canu”、“anan”、“nana”.Order module 402 is by all sequences according to word Canonical ordering is arranged, and rank results are:“anan”、“anuc”、“canu”、“cccc”、“iiii”、“nana”、“nuca”、“ucan”.
BWTS results acquisition module 404 takes out the last character of above-mentioned each sequence, constitutes the output of BWTS: " ncuciaan ", output to output buffer module 116.
Summarize the process as follows:icanucan => [i][c][anuc][an] => [anan] => ncuciaan
[anuc]
[canu]
[cccc]
[iiii]
[nana]
[nuca]
[ucan]
To achieve these goals, the present invention also provides a kind of improved BWT data compression methods, including:Input character String, keeps in pending character string, and synchrodata input and data processing, by character string after having processed by input buffer module Export to Lyndon Word searching moduls;Searched by Lyndon Word searching moduls and come from input buffer module character string In Lyndon Word most long, and the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, will The length of each Lyndon Word most long is exported to Lyndon Word length cache modules;Mould is cached by Lyndon Word Block is kept in output and is used for transposition module from the Lyndon Word of Lyndon Word searching moduls;It is long by Lyndon Word The length and number supply and discharge sequence mould of all Lyndon Word found in the temporary Lyndon Word searching moduls of degree cache module Block is used;The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and by transposition knot Fruit is kept in transposition cache module;The transposition result exported by the temporary transposition module of transposition cache module is made for order module With;The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS The output of method, and it is temporarily stored in output buffer module;The character string of output is kept in by output buffer module, is made for subsequent module With.
The Lyndon Word of the searching data block are further included:Character is taken from input buffer module, and is recorded now The length of taken character string, is read in since the character string initial character by turn, and often increasing by one and being just inputted subsequent module is carried out Lyndon Word judge, if there are Lyndon Word, by string length input Lyndon Word length cache module, Length zero setting, next time takes the last character that character string then takes character string since this time;The word of character string submodule will be taken Symbol string is input into Lyndon Word judging submodules and is gradually shifted the character string, and all shift character strings are defeated Enter N*N registers;Come from all shift characters statements based on collusion of the character string to be judged of a bit submodule by N*N register storages The treatment of Lyndon Word judging submodules;Shift character string is gradually taken out from N*N registers and is contrasted with former character string, its In:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, the character string is exported to Lyndon Word cache modules.
The transposition for completing all Lyndon Word is further included:By in Lyndon Word length temporary storage modules Content differentiate processed character string Lyndon Word most long length, and the numerical value is sent to character string extension submodule Block;All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following Ring displacement submodule is used;The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to turning Put cache module.
All character strings that transposition is completed are further included by lexcographical order sequence:By the word in transposition cache module Symbol string is used according to lexcographical order sequence for BWTS result acquisition modules;Last row of the ranking results of sorting sub-module are read Go out, as the output of BWTS methods, and keep in output buffer module.
Improved BWT data compression methods disclosed by the invention and its system for implementing hardware, can change existing BWT methods The constant that must be generated by direct transform could realize the situation that character string is recovered.This change on the one hand should in data compression Compression ratio can be improved in, the blocking treatment of subsequent step of being especially more convenient in terms of the hardware realization of data compression.Remove Outside this, BWT algorithms are often used as channel coding, but if mistake occurs in the transmitting procedure of strong noise in the constant of its generation Miss or lose, whole character string will be caused to recover, and this algorithm can solve the problem:The inversion of improved B WT algorithms Always since zero character, one of character errors will only influence one in whole character string to the starting point changed Or two characters, whole character string will not be impacted.For example in examples detailed above, ' icanucan ' is calculated by improved B WT Be output as ' ncuciaan ' after method to occur mistake in the transmission is just ' ncuchaan ', then by generating character string after inverse transformation There is mistake in ' icanhcan ', only one of which character, greatly reduces error rate.
During those skilled in the art of the present technique are appreciated that the present invention can be related to for performing operation described herein One or more equipment of operation.The equipment can be for needed for purpose and specially design and manufacture, or can also include Known device in all-purpose computer, the all-purpose computer is activated or reconstructed with having procedure Selection of the storage in it.This The computer program of sample can be stored in equipment(For example, computer)In computer-readable recording medium or storage be suitable to storage electronics refer to Make and be coupled to respectively in any kind of medium of bus, the computer-readable medium is including but not limited to any kind of Disk(Including floppy disk, hard disk, CD, CD-ROM and magneto-optic disk), memory immediately(RAM), read-only storage(ROM), electricity can compile Journey ROM, electrically erasable ROM(EPROM), electrically erasable ROM(EEPROM), flash memory, magnetic card or light card.It is readable Medium includes being used for by equipment(For example, computer)Readable form storage or any mechanism of transmission information.For example, readable Medium includes memory immediately(RAM), read-only storage(ROM), magnetic disk storage medium, optical storage medium, flash memory device, with The signal that electricity, light, sound or other forms are propagated(Such as carrier wave, infrared signal, data signal)Deng.
Those skilled in the art of the present technique be appreciated that can be realized with computer program instructions these structure charts and/or The combination of the frame in each frame and these structure charts and/or block diagram and/or flow graph in block diagram and/or flow graph.Can be by this A little computer program instructions are supplied to the processor of all-purpose computer, special purpose computer or other programmable data processing methods Generation machine, so as to the instruction that is performed by the processor of computer or other programmable data processing methods create for The method specified in the frame or multiple frames of realizing structure chart and/or block diagram and/or flow graph.
Those skilled in the art of the present technique are appreciated that in various operations, method, the flow discussed in the present invention Step, measure, scheme can be replaced, changed, combined or deleted.Further, it is each with what is discussed in the present invention Other steps, measure in kind operation, method, flow, scheme can also be replaced, changed, reset, decomposed, combined or deleted. Further, it is of the prior art with various operations, method, the flow disclosed in the present invention in step, measure, scheme Can also be replaced, changed, reset, decomposed, combined or deleted.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (8)

1. a kind of system for implementing hardware of improved BWT data compression methods, it is characterised in that including:
Input buffer module, for keeping in pending character string, and synchrodata input and data processing, by character after having processed String is exported to Lyndon Word searching moduls;
Lyndon Word searching moduls, for searching the Lyndon Word most long come from input buffer module character string, And the Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by each Lyndon Word's most long Length is exported to Lyndon Word length cache modules;
Lyndon Word cache modules, transposition is supplied for temporary output from the Lyndon Word of Lyndon Word searching moduls Module is used;
Lyndon Word length cache modules, for keeping in all Lyndon found in Lyndon Word searching moduls The length and number of Word are used for order module;
Transposition module, for completing the transposition of all Lyndon Word in Lyndon Word searching moduls and transposition result is temporary Deposit to transposition cache module;
Transposition cache module, the transposition result for keeping in the output of transposition module is used for order module;
Order module, for all character strings in transposition cache module to be sorted by lexcographical order, and takes last row conduct The output of BWTS methods, and it is temporarily stored in output buffer module;
Output buffer module, the character string for keeping in output, uses for subsequent module.
2. system according to claim 1, it is characterised in that the Lyndon Word searching moduls are further included:
Character string submodule is taken, for taking character from input buffer module, and the length of now taken character string is recorded, from character String initial character starts to read in by turn, and often increase by is just inputted subsequent module carries out Lyndon Word judgements, if occurring Lyndon Word, then by string length input Lyndon Word length cache module, length zero setting, take character string next time then The last character of character string is taken since this time;
Displacement submodule, the character string for will take character string submodule is input into Lyndon Word judging submodules and should Character string is gradually shifted, and all shift character strings are input into N*N registers;
N*N registers, all shift characters statements based on collusion Lyndon for storing the character string to be judged for coming from displacement submodule The treatment of Word judging submodules;
Lyndon Word judging submodules, for gradually from N*N registers take out shift character string and with former character string pair Than, wherein:Contrast number is the length of character string to be judged, if the former symbol string dictionary sequence of comparing result display is minimum, the word Symbol string is Lyndon Word, and the length of each Lyndon Word most long is exported to Lyndon Word length cache modules.
3. system according to claim 1, it is characterised in that the transposition module is further included:
Lyndon Word length most long differentiates submodule, for being differentiated by the content in Lyndon Word length temporary storage modules The length of the Lyndon Word most long of processed character string, and the length is sent to character string extension submodule;
Character string extends submodule, most long for all Lyndon Word in Lyndon Word cache modules to be extended to The length of Lyndon Word is used for cyclic shift submodule;
Cyclic shift submodule, for will come from the Lyndon Word cyclic shifts successively of character string extension submodule, and stores up Deposit to transposition cache module.
4. system according to claim 1, it is characterised in that the order module includes:
Sorting sub-module, for the character string in transposition cache module to be made according to lexcographical order sequence for BWTS result acquisition modules With;
BWTS result acquisition modules, for last row of the ranking results of sorting sub-module to be read, as BWTS methods Output, and keep in output buffer module.
5. a kind of improved BWT data compression methods, it is characterised in that including:
Input character string, pending character string, and synchrodata input and data processing, treatment are kept in by input buffer module Character string is exported to Lyndon Word searching moduls after complete;
The Lyndon Word most long come from input buffer module character string are searched by Lyndon Word searching moduls, and The Lyndon Word most long that will be found are exported to Lyndon Word cache modules, by the length of each Lyndon Word most long Degree is exported to Lyndon Word length cache modules;
Output is kept in by Lyndon Word cache modules and supplies transposition mould from the Lyndon Word of Lyndon Word searching moduls Block is used;
By all Lyndon Word found in the temporary Lyndon Word searching moduls of Lyndon Word length cache module Length and number used for order module;
The transposition of all Lyndon Word in Lyndon Word searching moduls is completed by transposition module, and transposition result is temporary Deposit to transposition cache module;
The transposition result exported by the temporary transposition module of transposition cache module is used for order module;
The all character strings in transposition cache module are sorted by lexcographical order by order module, and takes last row as BWTS The output of method, and it is temporarily stored in output buffer module;
The character string of output is kept in by output buffer module, is used for subsequent module.
6. method according to claim 5, it is characterised in that the Lyndon Word searching moduls are further included:
Character is taken from input buffer module, and record the length of now taken character string, read by turn since the character string initial character Enter, often increase by is just inputted subsequent module carries out Lyndon Word judgements, if there are Lyndon Word, by character String length input Lyndon Word length cache module, length zero setting, take character string and then take the last of character string from this time next time One character starts;
The character string input Lyndon Word judging submodules of character string submodule will be taken and the character string is gradually moved Position, and all shift character strings are input into N*N registers;
Come from all shift characters statements based on collusion Lyndon of the character string to be judged of displacement submodule by N*N register storages The treatment of Word judging submodules;
Shift character string is gradually taken out from N*N registers and is contrasted with former character string, wherein:Contrast number is character to be judged The length of string, if the former symbol string dictionary sequence of comparing result display is minimum, the character string is Lyndon Word, and each is most long The length of Lyndon Word is exported to Lyndon Word length cache modules.
7. method according to claim 5, it is characterised in that the transposition of all Lyndon Word of completion is further Including:
The length of the Lyndon Word most long of processed character string is differentiated by the content in Lyndon Word length temporary storage modules Degree, and the length is sent to character string extension submodule;
All Lyndon Word in Lyndon Word cache modules are extended into the length of Lyndon Word most long for following Ring displacement submodule is used;
The Lyndon Word cyclic shifts successively of character string extension submodule will be come from, and stored to transposition cache module.
8. method according to claim 5, it is characterised in that all character strings for completing transposition are arranged by lexcographical order Sequence is further included:
Character string in transposition cache module is used according to lexcographical order sequence for BWTS result acquisition modules;
Last row of the ranking results of sorting sub-module are read, as the output of BWTS methods, and is kept in output caching Module.
CN201410571262.3A 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware Active CN104284189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410571262.3A CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410571262.3A CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Publications (2)

Publication Number Publication Date
CN104284189A CN104284189A (en) 2015-01-14
CN104284189B true CN104284189B (en) 2017-06-16

Family

ID=52258598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410571262.3A Active CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Country Status (1)

Country Link
CN (1) CN104284189B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005464B (en) * 2015-07-02 2017-10-10 东南大学 A kind of Burrows Wheeler mapping hardware processing units
CN107342102B (en) * 2016-04-29 2021-04-27 上海磁宇信息科技有限公司 MRAM chip with search function and search method
CN116821967B (en) * 2023-08-30 2023-11-21 山东远联信息科技有限公司 Intersection computing method and system for privacy protection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6674908B1 (en) * 2002-05-04 2004-01-06 Edward Lasar Aronov Method of compression of binary data with a random number generator
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130019029A1 (en) * 2011-07-13 2013-01-17 International Business Machines Corporation Lossless compression of a predictive data stream having mixed data types

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6674908B1 (en) * 2002-05-04 2004-01-06 Edward Lasar Aronov Method of compression of binary data with a random number generator
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
On Two-Dimensional Lyndon Words;S Marcus;《International Symposium on String Processing and Information Retrieval》;20131009;全文 *
高速数据压缩与缓存的FPGA实现;王宁;《微计算机信息》;20080603;第24卷(第8期);全文 *

Also Published As

Publication number Publication date
CN104284189A (en) 2015-01-14

Similar Documents

Publication Publication Date Title
Adjeroh et al. The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching
US5371499A (en) Data compression using hashing
KR101956031B1 (en) Data compressor, memory system comprising the compress and method for compressing data
CN110428868B (en) Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
CN107066837B (en) Method and system for compressing reference DNA sequence
EP0577330A2 (en) Improved variable length decoder
CN104284189B (en) A kind of improved BWT data compression methods and its system for implementing hardware
EP2791854A2 (en) Counter operation in a state machine lattice
CN100525450C (en) Method and device for realizing Hoffman decodeng
US20130019029A1 (en) Lossless compression of a predictive data stream having mixed data types
US20060022848A1 (en) Arithmetic code decoding method and apparatus
Chen et al. A high-throughput FPGA accelerator for short-read mapping of the whole human genome
US4802108A (en) Circuit for providing a select rank-order number from a plurality of numbers
CN103746706A (en) Testing data compressing and decompressing method on basis of double-run-length alternate coding
US7764205B2 (en) Decompressing dynamic huffman coded bit streams
WO2024066561A1 (en) Apparatus and method for searching for free memory and chip
Arming et al. Data compression in hardware—the burrows-wheeler approach
Hayfron-Acquah et al. Improved selection sort algorithm
CN100546200C (en) Be used for method, decoder, system and equipment from the bitstream decoding codewords of variable length
US10084477B2 (en) Method and apparatus for adaptive data compression
EP1290542A2 (en) Determination of a minimum or maximum value in a set of data
US20040186977A1 (en) Method and apparatus for finding repeated substrings in pattern recognition
CN1098565C (en) Method and apparatus for decoding variable length code
JP2007274051A (en) Byte sequence searcher and searching method
CN111443891B (en) Variable-length merging and sorting implementation method for electric power internet of things data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant