CN102831222A - Differential compression method based on data de-duplication - Google Patents

Differential compression method based on data de-duplication Download PDF

Info

Publication number
CN102831222A
CN102831222A CN2012103036504A CN201210303650A CN102831222A CN 102831222 A CN102831222 A CN 102831222A CN 2012103036504 A CN2012103036504 A CN 2012103036504A CN 201210303650 A CN201210303650 A CN 201210303650A CN 102831222 A CN102831222 A CN 102831222A
Authority
CN
China
Prior art keywords
data
data block
piece
similar
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103036504A
Other languages
Chinese (zh)
Other versions
CN102831222B (en
Inventor
冯丹
夏文
江泓
田磊
付忞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201210303650.4A priority Critical patent/CN102831222B/en
Publication of CN102831222A publication Critical patent/CN102831222A/en
Application granted granted Critical
Publication of CN102831222B publication Critical patent/CN102831222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a differential compression method based on data de-duplication. The differential compression method includes steps of partitioning files in data flow to obtain multiple data blocks; computing data block fingerprint of each data block for searching duplicate data; grouping all the data blocks to establish data block groups and double link lists thereof; searching the fingerprint of each data block in each data block group for realizing data de-duplication so as to determine whether the data block is duplicated or not; searching similar data locally to the data block group which is subjected to the data de-duplication process according to the duplicated data information in the double link lists of the data block groups, namely, determining the non-duplicated data blocks adjacent to the duplicated data blocks as potential similar data blocks; verifying the similarity of the similar data blocks by differential compression; and finally complementarily searching similarity data to the data block groups according to the similarity. The differential compression method based on data de-duplication has the advantages of rapidness in similar data searching, low computing and indexing overhead and high data compression efficiency.

Description

A kind of residual quantity compression method based on data de-duplication
Technical field
The invention belongs to the field of data compression of Computer Storage, more specifically, relate to a kind of data difference piezometric compression method based on data de-duplication.
Background technology
In recent years, Along with computer technology and networks development are popularized, and the data information memory amount in the whole world is the trend of explosive growth.Though the price of memory device, also is unable to catch up with the speed that the data expansion increases far away always in the decline that does not stop.Data de-duplication (Data Deduplication) through eliminating the technology of redundant data on a large scale effectively, becomes the focus of storage system research in recent years as a kind of.Simply, at present emerging data de-duplication is exactly a kind of through eliminating redundant data on a large scale effectively, thereby reduces the important technology of data storage cost.For instance: certain core department has the data of 200GB to need backup every day now, just needs backup 73TB so in 1 year, and the data of in fact revising every day have caused a lot of redundant datas to repeat back-up storage less than 1GB like this.And traditional back-up storage technology can not be discerned the redundant data in the Backup Data, thereby can back up a large amount of repeating datas, has wasted the network bandwidth and storage space for no reason, has reduced the storage efficiency of data backup and filing.Along with the rapid growth with backup data quantity that increases of backup number of times, the redundant data in the storage system is more and more, and the storage and the management resource that consume on redundant data can be doubled and redoubled.And data de-duplication technology has just in time been catered to this demand, reduces the expense and the utilization factor that improves storage resources of data storage management through effective identification and the data of eliminating the redundancy of repetition.
But along with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Carry out the repeating data judgement because traditional data de-duplication technology is based on the fingerprint of data block, can only discern the data block that repeats fully so this has limited data de-duplication technology, and can not discern those very similar data blocks.Such as two data block A1 and the only several byte condition of different of A2,,, data de-duplication technology ignored processing to these similar data thereby can producing distinct data fingerprint though A1 is approaching similar fully with A2.So residual quantity (Delta) compress technique just is suggested and is applied in this occasion, the residual quantity compression is a data compression technique efficiently, and it can be according to reference data piece A rTo its similar data block A iCarry out high compression.The similarity of data block is high more, and then compression efficiency is high more.Shown in formula, A rAnd A iInput Delta algorithm device, the Delta algorithm device can be exported a residual quantity data △ R, iRepresentation file A iCompressed version.As need decompressed data A i, then read residual quantity data △ R, iWith reference data piece A rPromptly can calculate data A i
Figure BDA00002051744500021
Figure BDA00002051744500022
Yet there is following problem in existing residual quantity compress technique: its computing velocity is slow, and the index expense is big; Efficiency of data compression is low, and poor expandability is if support other similarity number of PB level according to retrieval; Can produce other similarity number of 10TB level and it is believed that the breath index; These metadata simultaneously, are brought the slow-footed bottleneck of index owing to putting into the disk storage management again owing to can not put into internal memory too greatly.The management of metadata and index have seriously limited the popularization and the development of residual quantity compression algorithm like this.
Summary of the invention
Defective to prior art; The object of the present invention is to provide a kind of residual quantity compression method based on data de-duplication; It is through carrying out the similarity data search that residual quantity is compressed with the locality of data stream and the combination of similarity; It is fast to have search efficiency, the little and high advantage of efficiency of data compression of index expense.
For realizing above-mentioned purpose, the invention provides a kind of residual quantity compression method based on data de-duplication, may further comprise the steps:
(1) file in the data stream is carried out piecemeal, obtain a plurality of data blocks;
(2) calculate the data block fingerprint of each data block, be used for repeating data and search;
(3) all data blocks are divided into groups; Setting up data chunk and doubly linked list thereof, and the fingerprint that each data block in the data chunk is carried out data de-duplication searched, to determine whether to exist fingerprint recording; If fingerprint recording is arranged, then this data block of mark is the repeating data piece; If no, then this data block of mark is non-repeating data piece;
(4) data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality, specifically comprises following substep:
(4-1) search the doubly linked list that data chunk belongs to, counter i=1 is set;
(4-2) judge whether i data block in the doubly linked list is the repeating data piece; If then take out the corresponding repeating data piece C of this repeating data piece dThe doubly linked list of place data chunk is provided with j=i, and changes step (4-3) over to; Otherwise i=i+1 is set, and gets into step (4-7); Here, the complex data piece C that weighs dReference data piece for the repeating data piece;
(4-3) judge whether i-1 data block is null value, or repetition or similar data block, if be not, then change step (4-4) over to; If then get into step (4-5);
(4-4) i-1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; I-1 data block and reference data piece are carried out the residual quantity compression and judged that whether compression efficiency is less than 1/2; If compression efficiency, thinks then that i-1 data block is not similar data block less than 1/2, and get into step (4-5); If compression efficiency is more than or equal to 1/2, then i-1 data block of mark is similar with the reference data piece, and i=i-1 then is set, and returns step (4-3);
(4-5) judge whether j+1 data block is null value, or repeat or similar data block,, then change (4-6) over to if be not; If i=j+1 then is set, and returns (4-7);
(4-6) j+1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; J+1 data block and reference data piece are carried out the residual quantity compression and whether judged compression efficiency less than 1/2, if compression efficiency, thinks then that j+1 data block is not similar data block less than 1/2; I=j+1 then is set, and gets into step (4-7); If compression efficiency is more than or equal to 1/2, then j+1 data block of mark is similar with the reference data piece, and j=j+1 then is set, and returns step (4-5);
(4-7) judge whether i data block is last data block in the doubly linked list, as
Fruit is that then process finishes, otherwise returns step (4-2).
(5) this data chunk being carried out similar judgement replenishes;
(6) repeated execution of steps (4) and (5) are till all data chunk of in handling step (3), dividing.
Adopt the data block fingerprint of SHA-1 algorithm, SHA-256 or each data block of SHA-512 algorithm computation in the step (2).
The size of data chunk is 2MB in the step (2).
Step (4) is specially, and the data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality; Through searching the adjacent non-repeating data piece of repeating data piece, it is regarded as potential similar data block, these data blocks of residual quantity compression and corresponding reference data piece piece, and verify these data blocks and the true similarity of reference data piece.
Step (5) is specially, and travels through the doubly linked list in this data chunk, is not similar data block for non-repetition the still; Calculate the super fingerprint of a low expense of this data block; Through searching the super fingerprint that super fingerprint index table judges whether coupling,, then read out the reference data piece of the super fingerprint indication of coupling if having; The mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.
Through the above technical scheme that the present invention conceived, compared with prior art, the present invention has following beneficial effect:
1, the present invention is through step (4); Excavated the locality characteristic in the data stream; Avoided traditional lengthy and tedious super fingerprint to calculate and search coupling; And only only need utilize the doubly linked list information of the data block of existing data de-duplication system, simplified the similar data search process of residual quantity compression; Calculating and the index expense of having avoided traditional similar data to judge.
2, the present invention is through step (5); To the remaining non-super fingerprint that non-similar data block is calculated a low expense that repeats; Carrying out similar data judges; Thereby can fully replenish the similarity search under the situation of locality difference, maximize the seek scope of similar data, improve data storage compression efficient with less cost.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the residual quantity compression method of data de-duplication.
Fig. 2 is the schematic block diagram of system that the present invention is based on the residual quantity compression of data de-duplication.
Fig. 3 is that step (3) and the repeating data in the step (4) of the inventive method searched the synoptic diagram with similar data search.
Fig. 4 is the principle of work synoptic diagram of system of the present invention.
Embodiment
In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The residual quantity compression method that the present invention is based on data de-duplication will treat Backup Data flow point piece, divide into groups after; Carry out data de-duplication, utilize the information of data de-duplication to carry out the judgement of similarity data block then, for the situation of locality difference; The super fingerprint that the present invention also adopts a kind of low expense replenishes; Similar data are searched in excavation through combining locality and similarity to greatest extent, improve the residual quantity compression efficiency, and the minimizing similarity number it is investigated and changed pin.
The present invention is called the locality unit with a plurality of continuous data block, and is kept at continuously in the disk external memory.The locality of the data stream in the storage system is meant, when data block once with sequence A, B, when C occurs, when next time occurring data block A so, data block B and C probably can follow the back closely.The present invention excavates the locality of this data stream and finds similar data, and is as shown in Figure 3: for front and back twice backed up data piece sequence: B 1, B 2, B 3, B 4, B 5And E 1, E 2, E 3, E 4, E 5, the method specified data piece B of employing data de-duplication 3And E 3Repeat B 4And E 4Repeat, so E 3And E 4The data block on next door very likely is similar data, i.e. B 1And E 1, B 2And E 2, B 5And E 5Be corresponding data block, and their similar probability are very big.If because data block B 3And E 3, B 4And E 4Repeat fully, file E has locality related with file B so, has caused these two files to have partial data piece fingerprint different for no other reason than that there is part to revise with inserting to operate.According to aforesaid principle of locality, these data blocks adjacent with repeating data are exactly that byte partly is modified or deletes, and cause having produced distinct data fingerprint.And these maybe be similar data blocks can through further residual quantity calculate determine whether similar.We practical test with observe, 90% thisly judge that based on locality the similarity degree of the non-repetitive data block that repeating data is adjacent is greater than 1/2.
To the situation of locality difference, or the situation that does not have the repeating data block message to support, the present invention takes a kind of super fingerprint method of low expense, replenishes and searches similar data block.Can find possible similar data substantially like this.Because traditional super fingerprint method is based on searching of probability, might miss the low data of similarity.And the method that we combine through locality and similarity, the method for promptly uniting based on the super fingerprint of data de-duplication and low expense is searched similar data, and the similar data that find than traditional super fingerprint method are many, and calculating and index expense are few.
As shown in Figure 1, the present invention is based on the residual quantity compression method of data de-duplication, may further comprise the steps:
(1) file in the data stream is carried out piecemeal, obtain a plurality of data blocks;
In the file block process, the present invention is applicable to content-based elongated piecemeal, and block algorithm is not required, and minute block size is not done requirement yet, 2KB ~ 256KB can, present embodiment adopts average mark block size 8KB.The present invention also is suitable for for the fixed length piecemeal, but better effects if under the elongated piecemeal.
(2) calculate the data block fingerprint of each data block, be used for repeating data and search;
The calculating of data block fingerprint can be adopted various secure hash digest algorithms, and present embodiment adopts the SHA-1 algorithm, also can adopt the stronger hash digest algorithm of other anti-hash-collision abilities, such as SHA-256 or SHA-512;
(3) all data blocks are divided into groups; Setting up data chunk and doubly linked list thereof, and the fingerprint that each data block in the data chunk is carried out data de-duplication searched, to determine whether to exist fingerprint recording; If fingerprint recording is arranged, then this data block of mark is the repeating data piece; If no, then this data block of mark is non-repeating data piece;
The size of the grouping that present embodiment adopts is 2MB; Be about to continuous a plurality of data blocks as a data chunk; And the size of this data chunk is 2MB, and promptly the total size of data block contents surpasses till the 2MB, according to aforesaid data stream principle of locality; The doubly linked list information of this data chunk has write down the sequence information of data stream, and this sequence information can be used for mining data stream locality and search similar data in following steps (4);
The fingerprint of the data de-duplication that present embodiment adopts is searched fingerprint value as shown in Figure 2, all and is all put into disk, and part is placed in the internal memory.At first, to the fingerprint value that will detect, whether system can retrieve it and in internal memory, hit, if then think the repeating data piece; If not, then retrieve the fingerprint index of disk, if retrieve this fingerprint value, then think the repeating data piece, and the data chunk at its fingerprint that retrieves place is all put into internal memory, can improve the internal memory hit rate of next access index like this; If not, think that then data block is non-repeating data piece;
(4) data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality, specifically comprises following substep:
(4-1) search the doubly linked list that data chunk belongs to, counter i=1 is set;
(4-2) judge whether i data block in the doubly linked list is the repeating data piece; If then take out the corresponding repeating data piece C of this repeating data piece dThe doubly linked list of place data chunk is provided with j=i, and changes step (4-3) over to; Otherwise i=i+1 is set, and gets into step (4-7); Here, the complex data piece C that weighs dReference data piece for the repeating data piece;
(4-3) judge whether i-1 data block is null value, or repetition or similar data block, if be not, then change step (4-4) over to; If then get into step (4-5);
(4-4) i-1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; I-1 data block and reference data piece are carried out the residual quantity compression and judged that whether compression efficiency is less than 1/2; If compression efficiency, thinks then that i-1 data block is not similar data block less than 1/2, and get into step (4-5); If compression efficiency is more than or equal to 1/2, then i-1 data block of mark is similar with the reference data piece, and i=i-1 then is set, and returns step (4-3);
(4-5) judge whether j+1 data block is null value, or repeat or similar data block,, then change (4-6) over to if be not; If i=j+1 then is set, and returns (4-7);
(4-6) j+1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; J+1 data block and reference data piece are carried out the residual quantity compression and whether judged compression efficiency less than 1/2, if compression efficiency, thinks then that j+1 data block is not similar data block less than 1/2; I=j+1 then is set, and gets into step (4-7); If compression efficiency is more than or equal to 1/2, then j+1 data block of mark is similar with the reference data piece, and j=j+1 then is set, and returns step (4-5);
(4-7) judge whether i data block is last data block in the doubly linked list, if then process finishes, otherwise returns step (4-2).
Adopt the Xdelta open source software of University of California Berkeley to carry out the residual quantity calculating of similar data in the present embodiment, the residual quantity compression algorithm of similar data of other calculating or file also is applicable to present embodiment.Compression efficiency threshold value about similar judgement also can be set by user oneself, is not limited to 1/2; The step in this instance (4-4) and (4-6) the corresponding data piece of indication are described illustratedly like aforesaid figure three, when judging the repeating data piece, then set up corresponding relation and the correspondence management of non-repeating data piece between the repeating data piece.
(5) this data chunk being carried out similar judgement replenishes; Particularly, traveling through the doubly linked list in this data chunk, is not similar data block for non-repetition the still; Calculate the super fingerprint of a low expense of this data block; Through searching the super fingerprint that super fingerprint index table judges whether coupling,, then read out the reference data piece of the super fingerprint indication of coupling if having; The mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.
In the present embodiment, we have adopted a kind of super fingerprint algorithm of low expense, and we use 2 Rabin fingerprints as similarity eigenwert Feature, and combination obtains the super fingerprint of data block to the similarity eigenwert again, and concrete grammar is shown in formula:
Feature i = Max j = 1 N { ( m i * Rabin ( W j ) + a i ) mod 2 32 } - - - ( 3 )
SuperFingerpr?int=Rabin(Feature 1,Feature 2) (4)
Here Feature iWhat refer to is exactly i similarity eigenwert of data block (length is N), and SuperFingerprint refers to super fingerprint, Rabin (W j) expression moving window W jThe Rabin fingerprint, m iAnd a iRepresent the predetermined random number of i group; Here the length of data block is N, so this data block has N moving window, the similarity eigenwert be exactly in the Rabin fingerprint of this N moving window again hash calculation get maximal value.Different i group predetermined value m iAnd a iWill produce the similarity eigenwert of i data block.
Traditional super fingerprint algorithm has adopted a plurality of m iAnd a iFour of combination results or the similarity eigenwert more than four are formed super fingerprint, and adopt that super fingerprint improves similar efficient of searching more than three or three: as long as promptly there is a super fingerprint matching just to assert that two data blocks are similar; Because the eigenwert than classic method that we adopt is lacked, and has only adopted a super fingerprint.So computing cost is few, super fingerprint index is few.Here; The Rabin fingerprint number of similarity eigenwert and the modification of super fingerprint number all are applicable to the present invention; The present invention recommends to use the similarity eigenwert of trying one's best few and the basis of few super fingerprint as the similarity search coupling of trying one's best, with calculating and the index expense that reduces super fingerprint; This is because the similar data search strategy based on locality in step (4) has found most similar data, and step (5) is just as a replenishment strategy, so can adopt the super fingerprint of low expense to calculate.
The present invention finds most similar data through the locality method, and similarity method replenishes the similar data (as shown in Figure 4) of finding that fraction is omitted then.It is few to have computing cost, and the index expense is low, searches the many advantages of similar data;
(6) repeated execution of steps (4) and (5) are till all data chunk of in handling step (3), dividing.
As shown in Figure 2; The residual quantity compressibility that the present invention is based on data de-duplication method comprises two functional modules; Be data de-duplication module and residual quantity compression module, wherein the residual quantity compression module mainly is divided into again based on the similar data search module of locality and similar data search module based on similarity.Wherein search and can be called similar data search based on data de-duplication based on locality similar; Similar data search based on similarity can be called the similar data search based on the low super fingerprint of expense again.
In the data structure of internal memory, the data de-duplication module has comprised repeating data index and locality buffer memory.The residual quantity compression module has comprised super fingerprint index and locality buffer memory.Wherein the locality buffer memory has comprised the data chunk of nearest visit, and each data chunk is made up of the metadata of a plurality of data blocks, and the crucial metadata information of each data block has comprised data block fingerprint and the super fingerprint of data block.Wherein the data block fingerprint is used for repeating data and searches, and the super fingerprint of data block is used for similar data search.
The data de-duplication module is responsible for mainly that data stream piecemeal in the storage system calculates, the fingerprint of data block calculates and fingerprint matching such as searches at operation.The data de-duplication module has adopted content-based block algorithm to carry out the content piecemeal, has avoided data insertion or retouching operation to cause the problem of new data boundary shifts.The calculating of data block fingerprint can be adopted various secure hash digest algorithms, and present embodiment adopts SHA-1, also can adopt the stronger secure hash digest algorithm of other anti-collision abilities.Fingerprint matching is searched and can all be put into all finger print informations the data de-duplication index of internal memory and search in the little situation of data scale.Under the big situation of data scale; Whole fingerprints is put into disk; Simultaneously the nearest fingerprint set of visiting of part is imported in the locality buffer memory of internal memory; Like this can be in internal memory the locality of data cached stream, can improve the hit rate of internal storage access index, also help further excavating locality and carry out that similar data are judged and the residual quantity compression.
The residual quantity compression module mainly is responsible for the searching and compress of similar data behind the data de-duplication.It judges that similar data mainly are divided into two stages, excavates existing data de-duplication information and carries out the similar data search stage based on locality; Adopt the super fingerprint that hangs down expense to come the similarity of mining data stream to replenish and search similar data phase.Here to the concrete residual quantity compression algorithm of two similar data blocks, we adopt the Xdelta software of increasing income of University of California Berkeley to calculate the residual quantity of similar data.After the residual quantity compression finished, we just need not store the information of complete similar data block.Only need storage residual quantity data and reference data piece positional information.Can reduce data space like this.
Below our explanation the present invention that cites an actual example; As shown in Figure 4, for one section input Backup Data stream, there is partial content to catch up with inferior the backup and compared modification and inserted operation; We represent the retouching operation here with check design, and we represent with twill to insert operation.At first we do piecemeal and ask fingerprint to handle data stream, shown in preceding step (1) and (2); We search through the data block fingerprint and carry out the repeating data retrieval then, and shown in preceding step (3), judging the 1st, 4,10 data block is the repeating data piece; We carry out the similar data search based on data de-duplication to non-repeating data piece once more, and shown in preceding step (4), we can judge the 2nd, 3,9 data block is similar data block; We are through non-ly repeating non-similar data block and carry out the similar data search based on the super fingerprint of low expense remaining at last, and shown in step (5), we can further judge the 6th, 7 data block is similar data block; For the remaining non-non-similar data block that repeats, we are with " N " expression.Like this, we can come to find to greatest extent similar data with similarity through the locality of excavating in the Backup Data stream; Especially based on the similar data search strategy of locality, can not need calculate the similarity eigenwert and just can find similar data block, the similarity number that has reduced the residual quantity compression it is investigated and changed pin.
Those skilled in the art will readily understand; The above is merely preferred embodiment of the present invention; Not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, be equal to and replace and improvement etc., all should be included within protection scope of the present invention.

Claims (5)

1. the residual quantity compression method based on data de-duplication is characterized in that, may further comprise the steps:
(1) file in the data stream is carried out piecemeal, obtain a plurality of data blocks;
(2) calculate the data block fingerprint of each data block, be used for repeating data and search;
(3) all data blocks are divided into groups; Setting up data chunk and doubly linked list thereof, and the fingerprint that each data block in the data chunk is carried out data de-duplication searched, to determine whether to exist fingerprint recording; If fingerprint recording is arranged, then this data block of mark is the repeating data piece; If no, then this data block of mark is non-repeating data piece;
(4) data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality, specifically comprises following substep:
(4-1) search the doubly linked list that data chunk belongs to, counter i=1 is set;
(4-2) judge whether i data block in the doubly linked list is the repeating data piece; If then take out the corresponding repeating data piece C of this repeating data piece dThe doubly linked list of place data chunk is provided with j=i, and changes step (4-3) over to; Otherwise i=i+1 is set, and gets into step (4-7); Here, the complex data piece C that weighs dReference data piece for the repeating data piece;
(4-3) judge whether i-1 data block is null value, or repetition or similar data block, if be not, then change step (4-4) over to; If then get into step (4-5);
(4-4) i-1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; I-1 data block and reference data piece are carried out the residual quantity compression and judged that whether compression efficiency is less than 1/2; If compression efficiency, thinks then that i-1 data block is not similar data block less than 1/2, and get into step (4-5); If compression efficiency is more than or equal to 1/2, then i-1 data block of mark is similar with the reference data piece, and i=i-1 then is set, and returns step (4-3);
(4-5) judge whether j+1 data block is null value, or repeat or similar data block,, then change (4-6) over to if be not; If i=j+1 then is set, and returns (4-7);
(4-6) j+1 data block read repeating data piece C dThe pairing data block of doubly linked list of place data chunk; The reference data piece of this data block as the residual quantity compression; J+1 data block and reference data piece are carried out the residual quantity compression and whether judged compression efficiency less than 1/2, if compression efficiency, thinks then that j+1 data block is not similar data block less than 1/2; I=j+1 then is set, and gets into step (4-7); If compression efficiency is more than or equal to 1/2, then j+1 data block of mark is similar with the reference data piece, and j=j+1 then is set, and returns step (4-5);
(4-7) judge whether i data block is last data block in the doubly linked list, if then process finishes, otherwise returns step (4-2);
(5) this data chunk being carried out similar judgement replenishes;
(6) repeated execution of steps (4) and (5) are till all data chunk of in handling step (3), dividing.
2. residual quantity compression method according to claim 1 is characterized in that, adopts the data block fingerprint of SHA-1 algorithm, SHA-256 or each data block of SHA-512 algorithm computation in the step (2).
3. residual quantity compression method according to claim 1 is characterized in that, the size of data chunk is 2MB.
4. residual quantity compression algorithm according to claim 1; It is characterized in that; Step (4) is specially, and the data chunk after handling for data de-duplication utilizes the repeating data information in the doubly linked list of this data chunk to carry out the similar data search based on locality; Through searching the adjacent non-repeating data piece of repeating data piece, it is regarded as potential similar data block, these data blocks of residual quantity compression and corresponding reference data piece piece, and verify these data blocks and the true similarity of reference data piece.
5. residual quantity compression method according to claim 1 is characterized in that, step (5) is specially; Travel through the doubly linked list in this data chunk; For non-repetition the still is not similar data block, calculates the super fingerprint of a low expense of this data block, through searching the super fingerprint that super fingerprint index table judges whether coupling; If have; Then read out the reference data piece of the super fingerprint indication of coupling, the mark current data block is similar with the reference data piece, and these two similar data blocks are carried out the residual quantity compression; If do not have, then continue the traversal doubly linked list.
CN201210303650.4A 2012-08-24 2012-08-24 Differential compression method based on data de-duplication Active CN102831222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210303650.4A CN102831222B (en) 2012-08-24 2012-08-24 Differential compression method based on data de-duplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210303650.4A CN102831222B (en) 2012-08-24 2012-08-24 Differential compression method based on data de-duplication

Publications (2)

Publication Number Publication Date
CN102831222A true CN102831222A (en) 2012-12-19
CN102831222B CN102831222B (en) 2014-12-31

Family

ID=47334357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210303650.4A Active CN102831222B (en) 2012-08-24 2012-08-24 Differential compression method based on data de-duplication

Country Status (1)

Country Link
CN (1) CN102831222B (en)

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology
CN103403717A (en) * 2013-01-09 2013-11-20 华为技术有限公司 Data processing method and data processing device
CN103412864A (en) * 2013-06-06 2013-11-27 莱诺斯科技(北京)有限公司 Data compression storage method
CN103744783A (en) * 2014-01-03 2014-04-23 华为技术有限公司 Method for measuring performance of repeating data deleting and device
CN103995863A (en) * 2014-05-19 2014-08-20 华为技术有限公司 Method and device for deleting repeating data
CN104050057A (en) * 2014-06-06 2014-09-17 华中科技大学 Historical sensed data duplicate removal fragment eliminating method and system
WO2015024511A1 (en) * 2013-08-21 2015-02-26 International Business Machines Corporation Adding cooperative file coloring in similarity based deduplication system
CN104572872A (en) * 2014-12-19 2015-04-29 华中科技大学 Data deduplication blocking method based on extreme value
CN105022593A (en) * 2015-08-18 2015-11-04 南京大学 Storage optimization method based on synergy of data compression and data redundancy removal
CN105069111A (en) * 2015-08-10 2015-11-18 广东工业大学 Similarity based data-block-grade data duplication removal method for cloud storage
CN105335530A (en) * 2015-12-11 2016-02-17 上海爱数信息技术股份有限公司 Method for improving large data block duplicated data deletion performance
CN105389387A (en) * 2015-12-11 2016-03-09 上海爱数信息技术股份有限公司 Compression based deduplication performance and deduplication rate improving method and system
CN105515586A (en) * 2015-12-14 2016-04-20 华中科技大学 Rapid delta compression method
CN105550352A (en) * 2015-12-28 2016-05-04 华为技术有限公司 Image based repeated data deletion method and apparatus
CN105787107A (en) * 2016-03-22 2016-07-20 南京工程学院 Big data redundancy detection method
CN105931278A (en) * 2015-02-28 2016-09-07 阿尔特拉公司 Methods And Apparatus For Two-dimensional Block Bit-stream Compression And Decompression
CN103403717B (en) * 2013-01-09 2016-11-30 华为技术有限公司 A kind of data processing method and device
CN106990914A (en) * 2017-02-17 2017-07-28 深圳市中博睿存信息技术有限公司 Data-erasure method and device
US9830229B2 (en) 2013-08-21 2017-11-28 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
CN107480267A (en) * 2017-08-17 2017-12-15 无锡清华信息科学与技术国家实验室物联网技术中心 A kind of method that file difference synchronizing speed is improved using locality
CN107612554A (en) * 2017-09-21 2018-01-19 国家电网公司 Data compressing method
CN107783990A (en) * 2016-08-26 2018-03-09 华为技术有限公司 A kind of data compression method and terminal
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN110618789A (en) * 2019-08-14 2019-12-27 华为技术有限公司 Method and device for deleting repeated data
CN111628909A (en) * 2020-05-25 2020-09-04 汪永强 Data repeated sending marking system and method for wireless communication
CN111796969A (en) * 2020-05-29 2020-10-20 湖北工业大学 Data difference compression detection method, computer equipment and storage medium
CN111954000A (en) * 2020-07-07 2020-11-17 广西交通设计集团有限公司 Lossless compression method for high-speed toll collection picture set
WO2021012162A1 (en) * 2019-07-22 2021-01-28 华为技术有限公司 Method and apparatus for data compression in storage system, device, and readable storage medium
CN112416878A (en) * 2020-11-09 2021-02-26 山西云时代技术有限公司 File synchronization management method based on cloud platform
CN113035278A (en) * 2021-04-08 2021-06-25 哈尔滨工业大学 TPBWT-based sliding window compression method based on self-indexing structure
US11055005B2 (en) 2018-10-12 2021-07-06 Netapp, Inc. Background deduplication using trusted fingerprints
US11409772B2 (en) 2019-08-05 2022-08-09 International Business Machines Corporation Active learning for data matching
WO2022206334A1 (en) * 2021-03-30 2022-10-06 华为技术有限公司 Data compression method and apparatus
US11663275B2 (en) 2019-08-05 2023-05-30 International Business Machines Corporation Method for dynamic data blocking in a database system
CN117150518A (en) * 2023-08-04 2023-12-01 ***通信集团四川有限公司 Communication carrier data security encryption method and system
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214210A (en) * 2011-05-16 2011-10-12 成都市华为赛门铁克科技有限公司 Method, device and system for processing repeating data
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214210A (en) * 2011-05-16 2011-10-12 成都市华为赛门铁克科技有限公司 Method, device and system for processing repeating data
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103403717B (en) * 2013-01-09 2016-11-30 华为技术有限公司 A kind of data processing method and device
CN103403717A (en) * 2013-01-09 2013-11-20 华为技术有限公司 Data processing method and data processing device
WO2014107845A1 (en) * 2013-01-09 2014-07-17 华为技术有限公司 Data processing method and device
CN103412864A (en) * 2013-06-06 2013-11-27 莱诺斯科技(北京)有限公司 Data compression storage method
CN103412864B (en) * 2013-06-06 2017-04-05 莱诺斯科技(北京)股份有限公司 A kind of data compression storage method
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology
CN103345449B (en) * 2013-06-19 2016-12-28 暨南大学 A kind of fingerprint forecasting method towards data de-duplication technology and system
US11048594B2 (en) 2013-08-21 2021-06-29 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
WO2015024511A1 (en) * 2013-08-21 2015-02-26 International Business Machines Corporation Adding cooperative file coloring in similarity based deduplication system
US9830229B2 (en) 2013-08-21 2017-11-28 International Business Machines Corporation Adding cooperative file coloring protocols in a data deduplication system
US9542411B2 (en) 2013-08-21 2017-01-10 International Business Machines Corporation Adding cooperative file coloring in a similarity based deduplication system
CN103744783A (en) * 2014-01-03 2014-04-23 华为技术有限公司 Method for measuring performance of repeating data deleting and device
CN103744783B (en) * 2014-01-03 2016-08-31 华为技术有限公司 Data de-duplication performance test methods and device
CN103995863B (en) * 2014-05-19 2018-06-19 华为技术有限公司 A kind of method and device of data de-duplication
CN103995863A (en) * 2014-05-19 2014-08-20 华为技术有限公司 Method and device for deleting repeating data
CN104050057A (en) * 2014-06-06 2014-09-17 华中科技大学 Historical sensed data duplicate removal fragment eliminating method and system
CN104572872A (en) * 2014-12-19 2015-04-29 华中科技大学 Data deduplication blocking method based on extreme value
CN104572872B (en) * 2014-12-19 2017-08-25 华中科技大学 A kind of data deduplication method of partition based on extreme value
CN105931278A (en) * 2015-02-28 2016-09-07 阿尔特拉公司 Methods And Apparatus For Two-dimensional Block Bit-stream Compression And Decompression
CN105069111B (en) * 2015-08-10 2018-09-18 广东工业大学 Block level data duplicate removal method based on similitude in cloud storage
CN105069111A (en) * 2015-08-10 2015-11-18 广东工业大学 Similarity based data-block-grade data duplication removal method for cloud storage
CN105022593A (en) * 2015-08-18 2015-11-04 南京大学 Storage optimization method based on synergy of data compression and data redundancy removal
CN105022593B (en) * 2015-08-18 2017-09-26 南京大学 A kind of storage optimization method cooperateed with based on data compression and data de-redundant
CN105389387A (en) * 2015-12-11 2016-03-09 上海爱数信息技术股份有限公司 Compression based deduplication performance and deduplication rate improving method and system
CN105335530A (en) * 2015-12-11 2016-02-17 上海爱数信息技术股份有限公司 Method for improving large data block duplicated data deletion performance
CN105389387B (en) * 2015-12-11 2018-12-14 上海爱数信息技术股份有限公司 A kind of data de-duplication performance based on compression and the method and system for deleting rate promotion again
CN105335530B (en) * 2015-12-11 2018-10-19 上海爱数信息技术股份有限公司 A method of promoting long data block data de-duplication performance
CN105515586B (en) * 2015-12-14 2019-04-12 华中科技大学 A kind of quick residual quantity compression method
CN105515586A (en) * 2015-12-14 2016-04-20 华中科技大学 Rapid delta compression method
CN105550352A (en) * 2015-12-28 2016-05-04 华为技术有限公司 Image based repeated data deletion method and apparatus
CN105787107A (en) * 2016-03-22 2016-07-20 南京工程学院 Big data redundancy detection method
CN105787107B (en) * 2016-03-22 2018-10-30 南京工程学院 A kind of big data redundant detecting method
CN107783990A (en) * 2016-08-26 2018-03-09 华为技术有限公司 A kind of data compression method and terminal
CN107783990B (en) * 2016-08-26 2021-11-19 华为技术有限公司 Data compression method and terminal
CN106990914B (en) * 2017-02-17 2020-06-12 北京同有飞骥科技股份有限公司 Data deleting method and device
CN106990914A (en) * 2017-02-17 2017-07-28 深圳市中博睿存信息技术有限公司 Data-erasure method and device
CN107480267A (en) * 2017-08-17 2017-12-15 无锡清华信息科学与技术国家实验室物联网技术中心 A kind of method that file difference synchronizing speed is improved using locality
CN107612554B (en) * 2017-09-21 2020-08-11 国家电网公司 Data compression processing method
CN107612554A (en) * 2017-09-21 2018-01-19 国家电网公司 Data compressing method
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
US11055005B2 (en) 2018-10-12 2021-07-06 Netapp, Inc. Background deduplication using trusted fingerprints
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN112544038B (en) * 2019-07-22 2024-07-05 华为技术有限公司 Method, device, equipment and readable storage medium for data compression of storage system
WO2021012162A1 (en) * 2019-07-22 2021-01-28 华为技术有限公司 Method and apparatus for data compression in storage system, device, and readable storage medium
CN112544038A (en) * 2019-07-22 2021-03-23 华为技术有限公司 Method, device and equipment for compressing data of storage system and readable storage medium
US11663275B2 (en) 2019-08-05 2023-05-30 International Business Machines Corporation Method for dynamic data blocking in a database system
US11409772B2 (en) 2019-08-05 2022-08-09 International Business Machines Corporation Active learning for data matching
WO2021027541A1 (en) * 2019-08-14 2021-02-18 华为技术有限公司 Data deduplication method and apparatus
CN110618789A (en) * 2019-08-14 2019-12-27 华为技术有限公司 Method and device for deleting repeated data
CN113472609B (en) * 2020-05-25 2024-03-19 汪永强 Data repeated sending marking system for wireless communication
CN111628909B (en) * 2020-05-25 2021-08-20 上海德吾信息科技有限公司 Data repeated sending marking system and method for wireless communication
CN113472609A (en) * 2020-05-25 2021-10-01 汪永强 Data repeated transmission marking system for wireless communication
CN111628909A (en) * 2020-05-25 2020-09-04 汪永强 Data repeated sending marking system and method for wireless communication
CN111796969A (en) * 2020-05-29 2020-10-20 湖北工业大学 Data difference compression detection method, computer equipment and storage medium
CN111954000A (en) * 2020-07-07 2020-11-17 广西交通设计集团有限公司 Lossless compression method for high-speed toll collection picture set
CN112416878A (en) * 2020-11-09 2021-02-26 山西云时代技术有限公司 File synchronization management method based on cloud platform
WO2022206334A1 (en) * 2021-03-30 2022-10-06 华为技术有限公司 Data compression method and apparatus
CN113035278A (en) * 2021-04-08 2021-06-25 哈尔滨工业大学 TPBWT-based sliding window compression method based on self-indexing structure
CN117150518A (en) * 2023-08-04 2023-12-01 ***通信集团四川有限公司 Communication carrier data security encryption method and system
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN102831222B (en) 2014-12-31

Similar Documents

Publication Publication Date Title
CN102831222B (en) Differential compression method based on data de-duplication
CN102222085B (en) Data de-duplication method based on combination of similarity and locality
Xia et al. {FastCDC}: A fast and efficient {Content-Defined} chunking approach for data deduplication
CN103902623B (en) Method and system for the accessing file in storage system
US7418544B2 (en) Method and system for log structured relational database objects
CN101963982B (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN108255647B (en) High-speed data backup method under samba server cluster
CN106663047A (en) Systems and methods for oprtimized signature comparisons and data replication
CN111400083B (en) Data storage method and system and storage medium
US20160350302A1 (en) Dynamically splitting a range of a node in a distributed hash table
CN102323958A (en) Data de-duplication method
CN107506260A (en) A kind of dynamic division database incremental backup method
CN103080910A (en) Storage system
CN108021717B (en) Method for implementing lightweight embedded file system
CN101866358A (en) Multidimensional interval querying method and system thereof
CN103678158B (en) A kind of data layout optimization method and system
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN103345496A (en) Multimedia information searching method and system
CN102169491B (en) Dynamic detection method for multi-data concentrated and repeated records
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
CN104050057B (en) Historical sensed data duplicate removal fragment eliminating method and system
CN112148217B (en) Method, device and medium for caching deduplication metadata of full flash memory system
CN107515931A (en) A kind of duplicate data detection method based on cluster
CN103970844B (en) The wiring method and device of big data, read method and device and processing system
CN114064984A (en) Sparse array linked list-based world state increment updating method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant