CN102831222A

CN102831222A - Differential compression method based on data de-duplication

Info

Publication number: CN102831222A
Application number: CN2012103036504A
Authority: CN
Inventors: 冯丹; 夏文; 江泓; 田磊; 付忞
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2012-08-24
Filing date: 2012-08-24
Publication date: 2012-12-19
Anticipated expiration: 2032-08-24
Also published as: CN102831222B

Abstract

The invention discloses a differential compression method based on data de-duplication. The differential compression method includes steps of partitioning files in data flow to obtain multiple data blocks; computing data block fingerprint of each data block for searching duplicate data; grouping all the data blocks to establish data block groups and double link lists thereof; searching the fingerprint of each data block in each data block group for realizing data de-duplication so as to determine whether the data block is duplicated or not; searching similar data locally to the data block group which is subjected to the data de-duplication process according to the duplicated data information in the double link lists of the data block groups, namely, determining the non-duplicated data blocks adjacent to the duplicated data blocks as potential similar data blocks; verifying the similarity of the similar data blocks by differential compression; and finally complementarily searching similarity data to the data block groups according to the similarity. The differential compression method based on data de-duplication has the advantages of rapidness in similar data searching, low computing and indexing overhead and high data compression efficiency.

Description

A kind of residual quantity compression method based on data de-duplication

Technical field

The invention belongs to the field of data compression of Computer Storage, more specifically, relate to a kind of data difference piezometric compression method based on data de-duplication.

Background technology

In recent years, Along with computer technology and networks development are popularized, and the data information memory amount in the whole world is the trend of explosive growth.Though the price of memory device, also is unable to catch up with the speed that the data expansion increases far away always in the decline that does not stop.Data de-duplication (Data Deduplication) through eliminating the technology of redundant data on a large scale effectively, becomes the focus of storage system research in recent years as a kind of.Simply, at present emerging data de-duplication is exactly a kind of through eliminating redundant data on a large scale effectively, thereby reduces the important technology of data storage cost.For instance: certain core department has the data of 200GB to need backup every day now, just needs backup 73TB so in 1 year, and the data of in fact revising every day have caused a lot of redundant datas to repeat back-up storage less than 1GB like this.And traditional back-up storage technology can not be discerned the redundant data in the Backup Data, thereby can back up a large amount of repeating datas, has wasted the network bandwidth and storage space for no reason, has reduced the storage efficiency of data backup and filing.Along with the rapid growth with backup data quantity that increases of backup number of times, the redundant data in the storage system is more and more, and the storage and the management resource that consume on redundant data can be doubled and redoubled.And data de-duplication technology has just in time been catered to this demand, reduces the expense and the utilization factor that improves storage resources of data storage management through effective identification and the data of eliminating the redundancy of repetition.

But along with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Carry out the repeating data judgement because traditional data de-duplication technology is based on the fingerprint of data block, can only discern the data block that repeats fully so this has limited data de-duplication technology, and can not discern those very similar data blocks.Such as two data block A1 and the only several byte condition of different of A2,,, data de-duplication technology ignored processing to these similar data thereby can producing distinct data fingerprint though A1 is approaching similar fully with A2.So residual quantity (Delta) compress technique just is suggested and is applied in this occasion, the residual quantity compression is a data compression technique efficiently, and it can be according to reference data piece A _rTo its similar data block A _iCarry out high compression.The similarity of data block is high more, and then compression efficiency is high more.Shown in formula, A _rAnd A _iInput Delta algorithm device, the Delta algorithm device can be exported a residual quantity data △ _{R, i}Representation file A _iCompressed version.As need decompressed data A _i, then read residual quantity data △ _{R, i}With reference data piece A _rPromptly can calculate data A _i

Yet there is following problem in existing residual quantity compress technique: its computing velocity is slow, and the index expense is big; Efficiency of data compression is low, and poor expandability is if support other similarity number of PB level according to retrieval; Can produce other similarity number of 10TB level and it is believed that the breath index; These metadata simultaneously, are brought the slow-footed bottleneck of index owing to putting into the disk storage management again owing to can not put into internal memory too greatly.The management of metadata and index have seriously limited the popularization and the development of residual quantity compression algorithm like this.

Summary of the invention

Defective to prior art; The object of the present invention is to provide a kind of residual quantity compression method based on data de-duplication; It is through carrying out the similarity data search that residual quantity is compressed with the locality of data stream and the combination of similarity; It is fast to have search efficiency, the little and high advantage of efficiency of data compression of index expense.

For realizing above-mentioned purpose, the invention provides a kind of residual quantity compression method based on data de-duplication, may further comprise the steps:

(1) file in the data stream is carried out piecemeal, obtain a plurality of data blocks;

(2) calculate the data block fingerprint of each data block, be used for repeating data and search;

(3) all data blocks are divided into groups; Setting up data chunk and doubly linked list thereof, and the fingerprint that each data block in the data chunk is carried out data de-duplication searched, to determine whether to exist fingerprint recording; If fingerprint recording is arranged, then this data block of mark is the repeating data piece; If no, then this data block of mark is non-repeating data piece;

(4) data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality, specifically comprises following substep:

(4-1) search the doubly linked list that data chunk belongs to, counter i=1 is set;

(4-2) judge whether i data block in the doubly linked list is the repeating data piece; If then take out the corresponding repeating data piece C of this repeating data piece _dThe doubly linked list of place data chunk is provided with j=i, and changes step (4-3) over to; Otherwise i=i+1 is set, and gets into step (4-7); Here, the complex data piece C that weighs _dReference data piece for the repeating data piece;

(4-3) judge whether i-1 data block is null value, or repetition or similar data block, if be not, then change step (4-4) over to; If then get into step (4-5);

(4-4) i-1 data block read repeating data piece C _dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; I-1 data block and reference data piece are carried out the residual quantity compression and judged that whether compression efficiency is less than 1/2; If compression efficiency, thinks then that i-1 data block is not similar data block less than 1/2, and get into step (4-5); If compression efficiency is more than or equal to 1/2, then i-1 data block of mark is similar with the reference data piece, and i=i-1 then is set, and returns step (4-3);

(4-5) judge whether j+1 data block is null value, or repeat or similar data block,, then change (4-6) over to if be not; If i=j+1 then is set, and returns (4-7);

(4-6) j+1 data block read repeating data piece C _dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; J+1 data block and reference data piece are carried out the residual quantity compression and whether judged compression efficiency less than 1/2, if compression efficiency, thinks then that j+1 data block is not similar data block less than 1/2; I=j+1 then is set, and gets into step (4-7); If compression efficiency is more than or equal to 1/2, then j+1 data block of mark is similar with the reference data piece, and j=j+1 then is set, and returns step (4-5);

(4-7) judge whether i data block is last data block in the doubly linked list, as

Fruit is that then process finishes, otherwise returns step (4-2).

(5) this data chunk being carried out similar judgement replenishes;

(6) repeated execution of steps (4) and (5) are till all data chunk of in handling step (3), dividing.

Adopt the data block fingerprint of SHA-1 algorithm, SHA-256 or each data block of SHA-512 algorithm computation in the step (2).

The size of data chunk is 2MB in the step (2).

Step (4) is specially, and the data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality; Through searching the adjacent non-repeating data piece of repeating data piece, it is regarded as potential similar data block, these data blocks of residual quantity compression and corresponding reference data piece piece, and verify these data blocks and the true similarity of reference data piece.

Step (5) is specially, and travels through the doubly linked list in this data chunk, is not similar data block for non-repetition the still; Calculate the super fingerprint of a low expense of this data block; Through searching the super fingerprint that super fingerprint index table judges whether coupling,, then read out the reference data piece of the super fingerprint indication of coupling if having; The mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.

Through the above technical scheme that the present invention conceived, compared with prior art, the present invention has following beneficial effect:

1, the present invention is through step (4); Excavated the locality characteristic in the data stream; Avoided traditional lengthy and tedious super fingerprint to calculate and search coupling; And only only need utilize the doubly linked list information of the data block of existing data de-duplication system, simplified the similar data search process of residual quantity compression; Calculating and the index expense of having avoided traditional similar data to judge.

2, the present invention is through step (5); To the remaining non-super fingerprint that non-similar data block is calculated a low expense that repeats; Carrying out similar data judges; Thereby can fully replenish the similarity search under the situation of locality difference, maximize the seek scope of similar data, improve data storage compression efficient with less cost.

Description of drawings

Fig. 1 is the process flow diagram that the present invention is based on the residual quantity compression method of data de-duplication.

Fig. 2 is the schematic block diagram of system that the present invention is based on the residual quantity compression of data de-duplication.

Fig. 3 is that step (3) and the repeating data in the step (4) of the inventive method searched the synoptic diagram with similar data search.

Fig. 4 is the principle of work synoptic diagram of system of the present invention.

Embodiment

In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

The residual quantity compression method that the present invention is based on data de-duplication will treat Backup Data flow point piece, divide into groups after; Carry out data de-duplication, utilize the information of data de-duplication to carry out the judgement of similarity data block then, for the situation of locality difference; The super fingerprint that the present invention also adopts a kind of low expense replenishes; Similar data are searched in excavation through combining locality and similarity to greatest extent, improve the residual quantity compression efficiency, and the minimizing similarity number it is investigated and changed pin.

The present invention is called the locality unit with a plurality of continuous data block, and is kept at continuously in the disk external memory.The locality of the data stream in the storage system is meant, when data block once with sequence A, B, when C occurs, when next time occurring data block A so, data block B and C probably can follow the back closely.The present invention excavates the locality of this data stream and finds similar data, and is as shown in Figure 3: for front and back twice backed up data piece sequence: B ₁, B ₂, B ₃, B ₄, B ₅And E ₁, E ₂, E ₃, E ₄, E ₅, the method specified data piece B of employing data de-duplication ₃And E ₃Repeat B ₄And E ₄Repeat, so E ₃And E ₄The data block on next door very likely is similar data, i.e. B ₁And E ₁, B ₂And E ₂, B ₅And E ₅Be corresponding data block, and their similar probability are very big.If because data block B ₃And E ₃, B ₄And E ₄Repeat fully, file E has locality related with file B so, has caused these two files to have partial data piece fingerprint different for no other reason than that there is part to revise with inserting to operate.According to aforesaid principle of locality, these data blocks adjacent with repeating data are exactly that byte partly is modified or deletes, and cause having produced distinct data fingerprint.And these maybe be similar data blocks can through further residual quantity calculate determine whether similar.We practical test with observe, 90% thisly judge that based on locality the similarity degree of the non-repetitive data block that repeating data is adjacent is greater than 1/2.

To the situation of locality difference, or the situation that does not have the repeating data block message to support, the present invention takes a kind of super fingerprint method of low expense, replenishes and searches similar data block.Can find possible similar data substantially like this.Because traditional super fingerprint method is based on searching of probability, might miss the low data of similarity.And the method that we combine through locality and similarity, the method for promptly uniting based on the super fingerprint of data de-duplication and low expense is searched similar data, and the similar data that find than traditional super fingerprint method are many, and calculating and index expense are few.

As shown in Figure 1, the present invention is based on the residual quantity compression method of data de-duplication, may further comprise the steps:

In the file block process, the present invention is applicable to content-based elongated piecemeal, and block algorithm is not required, and minute block size is not done requirement yet, 2KB ~ 256KB can, present embodiment adopts average mark block size 8KB.The present invention also is suitable for for the fixed length piecemeal, but better effects if under the elongated piecemeal.

The calculating of data block fingerprint can be adopted various secure hash digest algorithms, and present embodiment adopts the SHA-1 algorithm, also can adopt the stronger hash digest algorithm of other anti-hash-collision abilities, such as SHA-256 or SHA-512;

The size of the grouping that present embodiment adopts is 2MB; Be about to continuous a plurality of data blocks as a data chunk; And the size of this data chunk is 2MB, and promptly the total size of data block contents surpasses till the 2MB, according to aforesaid data stream principle of locality; The doubly linked list information of this data chunk has write down the sequence information of data stream, and this sequence information can be used for mining data stream locality and search similar data in following steps (4);

The fingerprint of the data de-duplication that present embodiment adopts is searched fingerprint value as shown in Figure 2, all and is all put into disk, and part is placed in the internal memory.At first, to the fingerprint value that will detect, whether system can retrieve it and in internal memory, hit, if then think the repeating data piece; If not, then retrieve the fingerprint index of disk, if retrieve this fingerprint value, then think the repeating data piece, and the data chunk at its fingerprint that retrieves place is all put into internal memory, can improve the internal memory hit rate of next access index like this; If not, think that then data block is non-repeating data piece;

(4-7) judge whether i data block is last data block in the doubly linked list, if then process finishes, otherwise returns step (4-2).

Adopt the Xdelta open source software of University of California Berkeley to carry out the residual quantity calculating of similar data in the present embodiment, the residual quantity compression algorithm of similar data of other calculating or file also is applicable to present embodiment.Compression efficiency threshold value about similar judgement also can be set by user oneself, is not limited to 1/2; The step in this instance (4-4) and (4-6) the corresponding data piece of indication are described illustratedly like aforesaid figure three, when judging the repeating data piece, then set up corresponding relation and the correspondence management of non-repeating data piece between the repeating data piece.

(5) this data chunk being carried out similar judgement replenishes; Particularly, traveling through the doubly linked list in this data chunk, is not similar data block for non-repetition the still; Calculate the super fingerprint of a low expense of this data block; Through searching the super fingerprint that super fingerprint index table judges whether coupling,, then read out the reference data piece of the super fingerprint indication of coupling if having; The mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.

In the present embodiment, we have adopted a kind of super fingerprint algorithm of low expense, and we use 2 Rabin fingerprints as similarity eigenwert Feature, and combination obtains the super fingerprint of data block to the similarity eigenwert again, and concrete grammar is shown in formula:

{Feature}_{i} = {Max}_{j = 1}^{N} {(m_{i} * Rabin (W_{j}) + a_{i}) {\mod 2}^{32}} - - - (3)

SuperFingerpr?int＝Rabin(Feature ₁,Feature ₂) （4）

Here Feature _iWhat refer to is exactly i similarity eigenwert of data block (length is N), and SuperFingerprint refers to super fingerprint, Rabin (W _j) expression moving window W _jThe Rabin fingerprint, m _iAnd a _iRepresent the predetermined random number of i group; Here the length of data block is N, so this data block has N moving window, the similarity eigenwert be exactly in the Rabin fingerprint of this N moving window again hash calculation get maximal value.Different i group predetermined value m _iAnd a _iWill produce the similarity eigenwert of i data block.

Traditional super fingerprint algorithm has adopted a plurality of m _iAnd a _iFour of combination results or the similarity eigenwert more than four are formed super fingerprint, and adopt that super fingerprint improves similar efficient of searching more than three or three: as long as promptly there is a super fingerprint matching just to assert that two data blocks are similar; Because the eigenwert than classic method that we adopt is lacked, and has only adopted a super fingerprint.So computing cost is few, super fingerprint index is few.Here; The Rabin fingerprint number of similarity eigenwert and the modification of super fingerprint number all are applicable to the present invention; The present invention recommends to use the similarity eigenwert of trying one's best few and the basis of few super fingerprint as the similarity search coupling of trying one's best, with calculating and the index expense that reduces super fingerprint; This is because the similar data search strategy based on locality in step (4) has found most similar data, and step (5) is just as a replenishment strategy, so can adopt the super fingerprint of low expense to calculate.

The present invention finds most similar data through the locality method, and similarity method replenishes the similar data (as shown in Figure 4) of finding that fraction is omitted then.It is few to have computing cost, and the index expense is low, searches the many advantages of similar data;

As shown in Figure 2; The residual quantity compressibility that the present invention is based on data de-duplication method comprises two functional modules; Be data de-duplication module and residual quantity compression module, wherein the residual quantity compression module mainly is divided into again based on the similar data search module of locality and similar data search module based on similarity.Wherein search and can be called similar data search based on data de-duplication based on locality similar; Similar data search based on similarity can be called the similar data search based on the low super fingerprint of expense again.

In the data structure of internal memory, the data de-duplication module has comprised repeating data index and locality buffer memory.The residual quantity compression module has comprised super fingerprint index and locality buffer memory.Wherein the locality buffer memory has comprised the data chunk of nearest visit, and each data chunk is made up of the metadata of a plurality of data blocks, and the crucial metadata information of each data block has comprised data block fingerprint and the super fingerprint of data block.Wherein the data block fingerprint is used for repeating data and searches, and the super fingerprint of data block is used for similar data search.

The data de-duplication module is responsible for mainly that data stream piecemeal in the storage system calculates, the fingerprint of data block calculates and fingerprint matching such as searches at operation.The data de-duplication module has adopted content-based block algorithm to carry out the content piecemeal, has avoided data insertion or retouching operation to cause the problem of new data boundary shifts.The calculating of data block fingerprint can be adopted various secure hash digest algorithms, and present embodiment adopts SHA-1, also can adopt the stronger secure hash digest algorithm of other anti-collision abilities.Fingerprint matching is searched and can all be put into all finger print informations the data de-duplication index of internal memory and search in the little situation of data scale.Under the big situation of data scale; Whole fingerprints is put into disk; Simultaneously the nearest fingerprint set of visiting of part is imported in the locality buffer memory of internal memory; Like this can be in internal memory the locality of data cached stream, can improve the hit rate of internal storage access index, also help further excavating locality and carry out that similar data are judged and the residual quantity compression.

The residual quantity compression module mainly is responsible for the searching and compress of similar data behind the data de-duplication.It judges that similar data mainly are divided into two stages, excavates existing data de-duplication information and carries out the similar data search stage based on locality; Adopt the super fingerprint that hangs down expense to come the similarity of mining data stream to replenish and search similar data phase.Here to the concrete residual quantity compression algorithm of two similar data blocks, we adopt the Xdelta software of increasing income of University of California Berkeley to calculate the residual quantity of similar data.After the residual quantity compression finished, we just need not store the information of complete similar data block.Only need storage residual quantity data and reference data piece positional information.Can reduce data space like this.

Below our explanation the present invention that cites an actual example; As shown in Figure 4, for one section input Backup Data stream, there is partial content to catch up with inferior the backup and compared modification and inserted operation; We represent the retouching operation here with check design, and we represent with twill to insert operation.At first we do piecemeal and ask fingerprint to handle data stream, shown in preceding step (1) and (2); We search through the data block fingerprint and carry out the repeating data retrieval then, and shown in preceding step (3), judging the 1st, 4,10 data block is the repeating data piece; We carry out the similar data search based on data de-duplication to non-repeating data piece once more, and shown in preceding step (4), we can judge the 2nd, 3,9 data block is similar data block; We are through non-ly repeating non-similar data block and carry out the similar data search based on the super fingerprint of low expense remaining at last, and shown in step (5), we can further judge the 6th, 7 data block is similar data block; For the remaining non-non-similar data block that repeats, we are with " N " expression.Like this, we can come to find to greatest extent similar data with similarity through the locality of excavating in the Backup Data stream; Especially based on the similar data search strategy of locality, can not need calculate the similarity eigenwert and just can find similar data block, the similarity number that has reduced the residual quantity compression it is investigated and changed pin.

Those skilled in the art will readily understand; The above is merely preferred embodiment of the present invention; Not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, be equal to and replace and improvement etc., all should be included within protection scope of the present invention.

Claims

1. the residual quantity compression method based on data de-duplication is characterized in that, may further comprise the steps:

(4-7) judge whether i data block is last data block in the doubly linked list, if then process finishes, otherwise returns step (4-2);

(5) this data chunk being carried out similar judgement replenishes;

2. residual quantity compression method according to claim 1 is characterized in that, adopts the data block fingerprint of SHA-1 algorithm, SHA-256 or each data block of SHA-512 algorithm computation in the step (2).

3. residual quantity compression method according to claim 1 is characterized in that, the size of data chunk is 2MB.

4. residual quantity compression algorithm according to claim 1; It is characterized in that; Step (4) is specially, and the data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality; Through searching the adjacent non-repeating data piece of repeating data piece, it is regarded as potential similar data block, these data blocks of residual quantity compression and corresponding reference data piece piece, and verify these data blocks and the true similarity of reference data piece.

5. residual quantity compression method according to claim 1 is characterized in that, step (5) is specially; Travel through the doubly linked list in this data chunk; For non-repetition the still is not similar data block, calculates the super fingerprint of a low expense of this data block, through searching the super fingerprint that super fingerprint index table judges whether coupling; If have; Then read out the reference data piece of the super fingerprint indication of coupling, the mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.