CN102323958A

CN102323958A - Data de-duplication method

Info

Publication number: CN102323958A
Application number: CN201110330421A
Authority: CN
Inventors: 安然; 谈川玉; 卢宝丰
Original assignee: WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI
Current assignee: WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI
Priority date: 2011-10-27
Filing date: 2011-10-27
Publication date: 2012-01-18

Abstract

The invention discloses a data de-duplication method, which comprises the following steps of: writing a file, lengthening the file, dividing the file into a plurality of data blocks with different lengths, and calculating Hash values of the data blocks; sampling the Hash values, and thus forming the sampling data of the file; by comparing the sampling data of the file with the sampling data of the conventional file, positioning a similarity group of the file; by comparing the Hash value of the file with the Hash value of the similarity group in a meta database, determining duplicated data blocks; de-duplicating, and storing non-duplicated data blocks; and generating a meta file, and storing the Hash values of the non-duplicated data blocks into the meta database. By adoption of the data de-duplication method, the occupation of de-duplication operation on resources of a system can be dynamically adjusted, the performance of in-line service is preferentially guaranteed, and the influence of the in-line service of the service is minimized. The data de-duplication method has the characteristics of high reliability, good stability and higher de-duplication rate.

Description

Data de-duplication method

Technical field

The present invention relates to a kind of data-erasure method, refer to a kind of data de-duplication method especially.

Background technology

Data de-duplication (De-duplication) is a kind of data reduction technology, is intended to reduce the memory capacity of using in the storage system.Its data through repeating in the deletion storage system only keeps portion wherein, thereby eliminates redundant data.Data de-duplication technology can reduce the consumption to amount of physical memory to a great extent.

Data de-duplication technology can be divided into online treatment mode (In-Line) and post processing mode (Post-Process) according to data processing method.

The data de-duplication method of online treatment mode is before data write disk, to carry out data de-duplication.The data de-duplication of online treatment has reduced data volume to a certain extent, but also has a problem simultaneously, goes retry itself can reduce data throughput rate, causes the reduction of service feature.In addition, owing to data de-duplication carried out before being written to disk, so the data de-duplication processing procedure itself is exactly a Single Point of Faliure.

The data de-duplication method of post processing mode is after data are write disk, to carry out data de-duplication again.Data are written to interim disk space earlier, begin data de-duplication afterwards again, and the data that will pass through data de-duplication at last write disk.Because data de-duplication is on independent memory device, to carry out after data write disk again, therefore generally regular traffic is handled exerting an influence hardly.But because present post processing mode can not be adjusted taking dynamically of system resource, do not possess the function of the online service feature of preferential guarantee yet, when system's occupancy is excessive, still can impact at line service system.

Data de-duplication technology is according to going heavy granularity can be divided into file-level, blocks of files level, byte level.

The data de-duplication of file-level is that unit detects, deletes repeating data with the file.The advantage of this mode is that algorithm is simple, and computing velocity is fast, and shortcoming is that heavily rate is low.

The data de-duplication of blocks of files level is divided into data block with a file by different modes, is that unit detects with the data block.The advantage of this method is that computing velocity is fast, responsive to the data variation.

The difference that the blocks of files level is heavily deleted according to partitioned mode is divided into fixed length piecemeal and elongated partitioned mode again.

See also Fig. 3, the fixed length partitioned mode is divided into the piece of regular length with file, but the method is very responsive to the insertion and the deletion of data, and the data multiplicity is very low in the practical application, and it is very limited heavily to delete effect.

Repeated content is searched and deleted to the data de-duplication of byte level from the byte level, generally generates the difference partial content through the difference Compression Strategies.It is that heavily rate is higher that byte level goes heavy advantage, and shortcoming is that heavy speed is slower.

In addition, traditional data de-duplication method provides data service through single physical equipment, when carrying out data de-duplication, can form the fault single-point, and system reliability has been brought challenge.

Summary of the invention

The objective of the invention is to overcome the defective of prior art; And a kind of data de-duplication method is provided; Realized dynamically to adjust and heavily deleted operation, preferentially ensured online performance of services, system has been reduced to minimum data de-duplication method in the influence of line service the taking of system resource; Have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.

The technical scheme that realizes above-mentioned purpose is:

A kind of data de-duplication method of the present invention comprises:

Write file, said file is carried out elongated piecemeal, form the plurality of data piece of different length and calculate the cryptographic hash of said data block;

Through said cryptographic hash is sampled, form the data from the sample survey of said file;

Through the data from the sample survey of more said file and the data from the sample survey of existing file, locate a similarity group of said file;

The cryptographic hash of similarity group is confirmed the repeating data piece described in a cryptographic hash through more said file and the metadatabase;

Remove heavily to preserve non-repeating data piece;

The generator file, and the cryptographic hash of said non-repeating data piece is saved in the said metadatabase.

Above-mentioned elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.

When calculating the cryptographic hash of above-mentioned data block; Through the cryptographic hash before the moving window slip in the said sliding window technique; Slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides, improved the operation efficiency of heavily deleting operation.

When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of a said data block size, the data in said data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.

When data from the sample survey at the data from the sample survey of more above-mentioned file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.

Above-mentioned data block is according to preserving like property component group.

Above-mentioned meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.

When a read-write requests is received by system, further may further comprise the steps:

Judge whether file destination is the file of deleting operation through overweight;

If said file destination is without overweight operation, the said file destination of direct read deleted;

If said file destination is deleted operation through overweight, then the meta file of said file destination is resolved the target data block of locating read-write request;

Accomplish operations such as read-write.

Above-mentionedly go the cycle of heavy step adjustable.

The present invention has been owing to adopted above technical scheme, makes it have following beneficial effect to be:

Data de-duplication among the present invention is heavily deleted technology based on the aftertreatment formula of strategy.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business, can dynamically adjust and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.Sliding window technique changes insensitive to file content, insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.When read-write is heavily deleted file and is write; Need not the data block that this document is all parses; Only need orient the data block of this write operation influence, carry out data manipulation among a small circle, these measures have ensured to the full extent opens the professional readwrite performance of heavily deleting function system.Realized dynamically to adjust and heavily deleted operation taking system resource; The online performance of services of preferential guarantee; System is reduced to minimum data de-duplication method in the influence of line service, have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.

Description of drawings

Fig. 1 is the process flow diagram of a kind of data de-duplication method of the present invention;

Fig. 2 is the elongated piecemeal technical schematic diagram of a kind of data de-duplication method of the present invention;

Fig. 3 is the fixed length piecemeal technical schematic diagram of prior art;

Fig. 4 is the similarity detection technique schematic diagram of prior art.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is described further.

See also Fig. 1, a kind of data de-duplication method of the present invention may further comprise the steps:

At first, write file, this document is carried out elongated piecemeal, form the plurality of data piece of different length and the cryptographic hash of computational data piece;

Through cryptographic hash is sampled, form the data from the sample survey of this document;

Through the data from the sample survey of comparison document and the data from the sample survey of existing file, a similarity group of locating file;

If the similarity of the data from the sample survey of the data from the sample survey of file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of file.

The cryptographic hash of similarity group is confirmed the repeating data piece in a cryptographic hash through comparison document and the metadatabase;

Remove heavily to preserve non-repeating data piece;

The generator file, and the cryptographic hash of non-repeating data piece is saved in the metadatabase.

Meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.

Data block is according to preserving like property component group.

If file destination is without overweight operation, the direct read file destination deleted;

If file destination is deleted operation through overweight, then the meta file of file destination is resolved the target data block of locating read-write request;

Accomplish operations such as read-write.

Above-mentionedly go the cycle of heavy step adjustable.

See also Fig. 2, elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.

When calculating the cryptographic hash of above-mentioned data block, through the cryptographic hash before the moving window slip in the sliding window technique, slip into byte value and the cryptographic hash that skids off the inside byte arrays after byte value calculating moving window slides, improved the operation efficiency of heavily deleting operation.

Adopt the two divisor algorithms (TTTD) of dual threshold, counterweight is deleted operation and has been carried out further performance optimization.When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of data block size, the data in data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.

Method is sampled to the cryptographic hash (HASH) of file data blocks through certain algorithm, the cryptographic hash contrast of data with existing piece in these sample values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.

When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unduplicated data block, and refresh corresponding metadata.

This technology has reduced the data query number of times in the identification repeating data process, heavily deletes compared with techniques with traditional online data, has promoted the performance of data de-duplication greatly.

The data de-duplication functional module is deployed in the system on each clustered node, and any node or several node failures in the cluster are heavily deleted business and the user writable business all can not be affected.

Heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.Heavily delete and adopted the elongated piecemeal of file technology, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.

See also Fig. 4, data de-duplication of the present invention uses the document similarity property detection technique that obtains patent to carry out the identification of repeating data.This detection technique is sampled to the HASH value of file data blocks through certain algorithm, the HASH value contrast of data with existing piece in these sampling HASH values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.

When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unique data block, and refresh corresponding metadata.

We use data de-duplication ratio (be called for short and heavily delete ratio) to weigh the effect of data de-duplication usually.If be illustrated in the required space size of storage data in the heritage storage system, be illustrated in the required space size of storage identical data in the storage system that band heavily deletes with the total amount of data after heavily deleting with the total amount of data before heavily deleting.Variance rate between these two numerical value promptly is the data de-duplication ratio.

Total amount of data after the data de-duplication ratio equals heavily to delete preceding total amount of data and heavily delete;

When the data in the file system being handled through the data de-duplication characteristic; Can there be very big difference in the quantity of repeating data section because of the difference of data character in the data set, and this depends on the type of data file and the application program of creating these files usually.Analysis to concrete application scenarios helps us to understand effect and the value of data de-duplication characteristic in these scenes.

Under some certain applications scene; For example: from one group of Backup Images of certain database, when writing data in the file system, the advantage of data de-duplication is often very obvious; Because each new write operation only can write the new data section that this operation is introduced; And in the traditional data library backup was used, the differential different time of data segment between two backups often had only 1%-2%, although high rate of change also can exist.Under such application scenarios, heavily delete more full confident efficiently than military order investor.

On the contrary, under another kind of application scenarios, for example: preserve the material database of tens thousand of parts of photos, the effect that data de-duplication can be obtained is then barely satisfactory.Because the repeating data hop count amount that can find between different photos is very limited.What this will finally be presented as poor efficiency heavily deletes ratio.

Therefore, when using data de-duplication technology, need to carry out concrete analysis to concrete application scenarios.We are recommended in two kinds of application scenarioss and open the data de-duplication function, and the one, the back-up application scene, under this kind scene, the repetition rate of data is high, and it is fairly obvious heavily to delete effect; The 2nd, the virtual machine application scenarios under this kind scene, can be deposited the copy of a large amount of virtual machine file and these files in the storage system, and the data repetition rate is high, and it is obvious heavily to delete effect.

Data de-duplication has used multiple technologies to optimize performance, can bring any influence hardly to the online performance of services of system.

At first, heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.

Secondly, heavily delete and adopted the elongated piecemeal technology of leading file, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.

In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.Compare with traditional data de-duplication technology, go that heavily rate is higher, performance is more excellent.

In the tradition data de-duplication product, through single physical equipment data service is provided, at this moment, data de-duplication software and the physical equipment that carries this software all can become the fault single-point, and system reliability has been brought challenge.

The present invention combines data de-duplication with the Active-Active Clustering, system-level reliability is provided.In the cluster environment of multinode, as long as any node is still in normal operation in the system, data de-duplication and heavily delete data write and can both carry out has smoothly ensured the continuity of client traffic so.

Data de-duplication function of the present invention is through eliminating the redundant data in the data space, and the user can benefit from storage space efficient.This will directly be presented as the reduction of initial stage storage purchase cost, and data de-duplication function effectively control data increases, and has also delayed follow-up storage dilatation demand.In addition, the reduction of memory space requirements makes the Unsupervised a large amount of memory device of user, the reduction that has brought O&M costs such as space, electric power, refrigeration and maintenance management.Reduce TCO to greatest extent.

When the data de-duplication technology among the present invention is applied to the back-up application scene, go heavy effect very obvious.In this application scenarios, backup server backs user data in the NAS storage space, and the backup policy through certain is equipped with entirely, increases fully, and the data multiplicity is high.

Data de-duplication among the present invention also has bigger advantage in the virtual machine application scenarios.In this application, the user is left a large amount of virtual machine file in the memory device in, and these files comprise identical OS system usually, and this just means a large amount of repeating datas.Heavily delete and optimize to virtual machine file, but the repeating data in this class file of efficient identification.

Data de-duplication is based on the aftertreatment formula of strategy and heavily deletes technology.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business.In addition, differently with traditional aftertreatment formula online data technology of heavily deleting be that heavily deleting among the present invention can dynamically be adjusted and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.

More than combine accompanying drawing embodiment that the present invention is specified, those skilled in the art can make the many variations example to the present invention according to above-mentioned explanation.Thereby some details among the embodiment should not constitute qualification of the present invention, and the scope that the present invention will define with appended claims is as protection scope of the present invention.

Claims

1. a data de-duplication method is characterized in that, may further comprise the steps:

Remove heavily to preserve non-repeating data piece;

2. data de-duplication method according to claim 1 is characterized in that, said elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content.

3. data de-duplication method according to claim 2; It is characterized in that; When calculating the cryptographic hash of said data block; Through the cryptographic hash before the moving window slip in the said sliding window technique, slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides.

4. data de-duplication method according to claim 1 is characterized in that, when calculating the cryptographic hash of said data block, limits the minimum value of a said data block size, and the data in said data block head minimum value interval are not carried out cryptographic hash calculating.

5. data de-duplication method according to claim 1; It is characterized in that; When data from the sample survey at the data from the sample survey of more said file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.

6. data de-duplication method according to claim 1 is characterized in that, said data block is according to preserving like property component group.

7. data de-duplication method according to claim 1 is characterized in that said meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.

8. data de-duplication method according to claim 1 is characterized in that, when a read-write requests is received by system, further may further comprise the steps:

Accomplish operations such as read-write.

9. data de-duplication method according to claim 1 is characterized in that, saidly goes the cycle of heavy step adjustable.