CN102323958A - Data de-duplication method - Google Patents

Data de-duplication method Download PDF

Info

Publication number
CN102323958A
CN102323958A CN201110330421A CN201110330421A CN102323958A CN 102323958 A CN102323958 A CN 102323958A CN 201110330421 A CN201110330421 A CN 201110330421A CN 201110330421 A CN201110330421 A CN 201110330421A CN 102323958 A CN102323958 A CN 102323958A
Authority
CN
China
Prior art keywords
data
file
duplication
cryptographic hash
duplication method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110330421A
Other languages
Chinese (zh)
Inventor
安然
谈川玉
卢宝丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI
Original Assignee
WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI filed Critical WENGUANG INTERDYANMIC TV CO Ltd SHANGHAI
Priority to CN201110330421A priority Critical patent/CN102323958A/en
Publication of CN102323958A publication Critical patent/CN102323958A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a data de-duplication method, which comprises the following steps of: writing a file, lengthening the file, dividing the file into a plurality of data blocks with different lengths, and calculating Hash values of the data blocks; sampling the Hash values, and thus forming the sampling data of the file; by comparing the sampling data of the file with the sampling data of the conventional file, positioning a similarity group of the file; by comparing the Hash value of the file with the Hash value of the similarity group in a meta database, determining duplicated data blocks; de-duplicating, and storing non-duplicated data blocks; and generating a meta file, and storing the Hash values of the non-duplicated data blocks into the meta database. By adoption of the data de-duplication method, the occupation of de-duplication operation on resources of a system can be dynamically adjusted, the performance of in-line service is preferentially guaranteed, and the influence of the in-line service of the service is minimized. The data de-duplication method has the characteristics of high reliability, good stability and higher de-duplication rate.

Description

Data de-duplication method
Technical field
The present invention relates to a kind of data-erasure method, refer to a kind of data de-duplication method especially.
Background technology
Data de-duplication (De-duplication) is a kind of data reduction technology, is intended to reduce the memory capacity of using in the storage system.Its data through repeating in the deletion storage system only keeps portion wherein, thereby eliminates redundant data.Data de-duplication technology can reduce the consumption to amount of physical memory to a great extent.
Data de-duplication technology can be divided into online treatment mode (In-Line) and post processing mode (Post-Process) according to data processing method.
The data de-duplication method of online treatment mode is before data write disk, to carry out data de-duplication.The data de-duplication of online treatment has reduced data volume to a certain extent, but also has a problem simultaneously, goes retry itself can reduce data throughput rate, causes the reduction of service feature.In addition, owing to data de-duplication carried out before being written to disk, so the data de-duplication processing procedure itself is exactly a Single Point of Faliure.
The data de-duplication method of post processing mode is after data are write disk, to carry out data de-duplication again.Data are written to interim disk space earlier, begin data de-duplication afterwards again, and the data that will pass through data de-duplication at last write disk.Because data de-duplication is on independent memory device, to carry out after data write disk again, therefore generally regular traffic is handled exerting an influence hardly.But because present post processing mode can not be adjusted taking dynamically of system resource, do not possess the function of the online service feature of preferential guarantee yet, when system's occupancy is excessive, still can impact at line service system.
Data de-duplication technology is according to going heavy granularity can be divided into file-level, blocks of files level, byte level.
The data de-duplication of file-level is that unit detects, deletes repeating data with the file.The advantage of this mode is that algorithm is simple, and computing velocity is fast, and shortcoming is that heavily rate is low.
The data de-duplication of blocks of files level is divided into data block with a file by different modes, is that unit detects with the data block.The advantage of this method is that computing velocity is fast, responsive to the data variation.
The difference that the blocks of files level is heavily deleted according to partitioned mode is divided into fixed length piecemeal and elongated partitioned mode again.
See also Fig. 3, the fixed length partitioned mode is divided into the piece of regular length with file, but the method is very responsive to the insertion and the deletion of data, and the data multiplicity is very low in the practical application, and it is very limited heavily to delete effect.
Repeated content is searched and deleted to the data de-duplication of byte level from the byte level, generally generates the difference partial content through the difference Compression Strategies.It is that heavily rate is higher that byte level goes heavy advantage, and shortcoming is that heavy speed is slower.
In addition, traditional data de-duplication method provides data service through single physical equipment, when carrying out data de-duplication, can form the fault single-point, and system reliability has been brought challenge.
Summary of the invention
The objective of the invention is to overcome the defective of prior art; And a kind of data de-duplication method is provided; Realized dynamically to adjust and heavily deleted operation, preferentially ensured online performance of services, system has been reduced to minimum data de-duplication method in the influence of line service the taking of system resource; Have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.
The technical scheme that realizes above-mentioned purpose is:
A kind of data de-duplication method of the present invention comprises:
Write file, said file is carried out elongated piecemeal, form the plurality of data piece of different length and calculate the cryptographic hash of said data block;
Through said cryptographic hash is sampled, form the data from the sample survey of said file;
Through the data from the sample survey of more said file and the data from the sample survey of existing file, locate a similarity group of said file;
The cryptographic hash of similarity group is confirmed the repeating data piece described in a cryptographic hash through more said file and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of said non-repeating data piece is saved in the said metadatabase.
Above-mentioned elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.
When calculating the cryptographic hash of above-mentioned data block; Through the cryptographic hash before the moving window slip in the said sliding window technique; Slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides, improved the operation efficiency of heavily deleting operation.
When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of a said data block size, the data in said data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.
When data from the sample survey at the data from the sample survey of more above-mentioned file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.
Above-mentioned data block is according to preserving like property component group.
Above-mentioned meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
When a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If said file destination is without overweight operation, the said file destination of direct read deleted;
If said file destination is deleted operation through overweight, then the meta file of said file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
Above-mentionedly go the cycle of heavy step adjustable.
The present invention has been owing to adopted above technical scheme, makes it have following beneficial effect to be:
Data de-duplication among the present invention is heavily deleted technology based on the aftertreatment formula of strategy.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business, can dynamically adjust and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.Sliding window technique changes insensitive to file content, insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.When read-write is heavily deleted file and is write; Need not the data block that this document is all parses; Only need orient the data block of this write operation influence, carry out data manipulation among a small circle, these measures have ensured to the full extent opens the professional readwrite performance of heavily deleting function system.Realized dynamically to adjust and heavily deleted operation taking system resource; The online performance of services of preferential guarantee; System is reduced to minimum data de-duplication method in the influence of line service, have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.
Description of drawings
Fig. 1 is the process flow diagram of a kind of data de-duplication method of the present invention;
Fig. 2 is the elongated piecemeal technical schematic diagram of a kind of data de-duplication method of the present invention;
Fig. 3 is the fixed length piecemeal technical schematic diagram of prior art;
Fig. 4 is the similarity detection technique schematic diagram of prior art.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment the present invention is described further.
See also Fig. 1, a kind of data de-duplication method of the present invention may further comprise the steps:
At first, write file, this document is carried out elongated piecemeal, form the plurality of data piece of different length and the cryptographic hash of computational data piece;
Through cryptographic hash is sampled, form the data from the sample survey of this document;
Through the data from the sample survey of comparison document and the data from the sample survey of existing file, a similarity group of locating file;
If the similarity of the data from the sample survey of the data from the sample survey of file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of file.
The cryptographic hash of similarity group is confirmed the repeating data piece in a cryptographic hash through comparison document and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of non-repeating data piece is saved in the metadatabase.
Meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
Data block is according to preserving like property component group.
When a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If file destination is without overweight operation, the direct read file destination deleted;
If file destination is deleted operation through overweight, then the meta file of file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
Above-mentionedly go the cycle of heavy step adjustable.
See also Fig. 2, elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.
When calculating the cryptographic hash of above-mentioned data block, through the cryptographic hash before the moving window slip in the sliding window technique, slip into byte value and the cryptographic hash that skids off the inside byte arrays after byte value calculating moving window slides, improved the operation efficiency of heavily deleting operation.
Adopt the two divisor algorithms (TTTD) of dual threshold, counterweight is deleted operation and has been carried out further performance optimization.When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of data block size, the data in data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.
Method is sampled to the cryptographic hash (HASH) of file data blocks through certain algorithm, the cryptographic hash contrast of data with existing piece in these sample values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.
When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unduplicated data block, and refresh corresponding metadata.
This technology has reduced the data query number of times in the identification repeating data process, heavily deletes compared with techniques with traditional online data, has promoted the performance of data de-duplication greatly.
The data de-duplication functional module is deployed in the system on each clustered node, and any node or several node failures in the cluster are heavily deleted business and the user writable business all can not be affected.
Heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.Heavily delete and adopted the elongated piecemeal of file technology, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.
See also Fig. 4, data de-duplication of the present invention uses the document similarity property detection technique that obtains patent to carry out the identification of repeating data.This detection technique is sampled to the HASH value of file data blocks through certain algorithm, the HASH value contrast of data with existing piece in these sampling HASH values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.
When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unique data block, and refresh corresponding metadata.
This technology has reduced the data query number of times in the identification repeating data process, heavily deletes compared with techniques with traditional online data, has promoted the performance of data de-duplication greatly.
We use data de-duplication ratio (be called for short and heavily delete ratio) to weigh the effect of data de-duplication usually.If be illustrated in the required space size of storage data in the heritage storage system, be illustrated in the required space size of storage identical data in the storage system that band heavily deletes with the total amount of data after heavily deleting with the total amount of data before heavily deleting.Variance rate between these two numerical value promptly is the data de-duplication ratio.
Total amount of data after the data de-duplication ratio equals heavily to delete preceding total amount of data and heavily delete;
When the data in the file system being handled through the data de-duplication characteristic; Can there be very big difference in the quantity of repeating data section because of the difference of data character in the data set, and this depends on the type of data file and the application program of creating these files usually.Analysis to concrete application scenarios helps us to understand effect and the value of data de-duplication characteristic in these scenes.
Under some certain applications scene; For example: from one group of Backup Images of certain database, when writing data in the file system, the advantage of data de-duplication is often very obvious; Because each new write operation only can write the new data section that this operation is introduced; And in the traditional data library backup was used, the differential different time of data segment between two backups often had only 1%-2%, although high rate of change also can exist.Under such application scenarios, heavily delete more full confident efficiently than military order investor.
On the contrary, under another kind of application scenarios, for example: preserve the material database of tens thousand of parts of photos, the effect that data de-duplication can be obtained is then barely satisfactory.Because the repeating data hop count amount that can find between different photos is very limited.What this will finally be presented as poor efficiency heavily deletes ratio.
Therefore, when using data de-duplication technology, need to carry out concrete analysis to concrete application scenarios.We are recommended in two kinds of application scenarioss and open the data de-duplication function, and the one, the back-up application scene, under this kind scene, the repetition rate of data is high, and it is fairly obvious heavily to delete effect; The 2nd, the virtual machine application scenarios under this kind scene, can be deposited the copy of a large amount of virtual machine file and these files in the storage system, and the data repetition rate is high, and it is obvious heavily to delete effect.
Data de-duplication has used multiple technologies to optimize performance, can bring any influence hardly to the online performance of services of system.
At first, heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.
Secondly, heavily delete and adopted the elongated piecemeal technology of leading file, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.
In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.Compare with traditional data de-duplication technology, go that heavily rate is higher, performance is more excellent.
In the tradition data de-duplication product, through single physical equipment data service is provided, at this moment, data de-duplication software and the physical equipment that carries this software all can become the fault single-point, and system reliability has been brought challenge.
The present invention combines data de-duplication with the Active-Active Clustering, system-level reliability is provided.In the cluster environment of multinode, as long as any node is still in normal operation in the system, data de-duplication and heavily delete data write and can both carry out has smoothly ensured the continuity of client traffic so.
Data de-duplication function of the present invention is through eliminating the redundant data in the data space, and the user can benefit from storage space efficient.This will directly be presented as the reduction of initial stage storage purchase cost, and data de-duplication function effectively control data increases, and has also delayed follow-up storage dilatation demand.In addition, the reduction of memory space requirements makes the Unsupervised a large amount of memory device of user, the reduction that has brought O&M costs such as space, electric power, refrigeration and maintenance management.Reduce TCO to greatest extent.
When the data de-duplication technology among the present invention is applied to the back-up application scene, go heavy effect very obvious.In this application scenarios, backup server backs user data in the NAS storage space, and the backup policy through certain is equipped with entirely, increases fully, and the data multiplicity is high.
Data de-duplication among the present invention also has bigger advantage in the virtual machine application scenarios.In this application, the user is left a large amount of virtual machine file in the memory device in, and these files comprise identical OS system usually, and this just means a large amount of repeating datas.Heavily delete and optimize to virtual machine file, but the repeating data in this class file of efficient identification.
Data de-duplication is based on the aftertreatment formula of strategy and heavily deletes technology.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business.In addition, differently with traditional aftertreatment formula online data technology of heavily deleting be that heavily deleting among the present invention can dynamically be adjusted and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.
More than combine accompanying drawing embodiment that the present invention is specified, those skilled in the art can make the many variations example to the present invention according to above-mentioned explanation.Thereby some details among the embodiment should not constitute qualification of the present invention, and the scope that the present invention will define with appended claims is as protection scope of the present invention.

Claims (9)

1. a data de-duplication method is characterized in that, may further comprise the steps:
Write file, said file is carried out elongated piecemeal, form the plurality of data piece of different length and calculate the cryptographic hash of said data block;
Through said cryptographic hash is sampled, form the data from the sample survey of said file;
Through the data from the sample survey of more said file and the data from the sample survey of existing file, locate a similarity group of said file;
The cryptographic hash of similarity group is confirmed the repeating data piece described in a cryptographic hash through more said file and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of said non-repeating data piece is saved in the said metadatabase.
2. data de-duplication method according to claim 1 is characterized in that, said elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content.
3. data de-duplication method according to claim 2; It is characterized in that; When calculating the cryptographic hash of said data block; Through the cryptographic hash before the moving window slip in the said sliding window technique, slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides.
4. data de-duplication method according to claim 1 is characterized in that, when calculating the cryptographic hash of said data block, limits the minimum value of a said data block size, and the data in said data block head minimum value interval are not carried out cryptographic hash calculating.
5. data de-duplication method according to claim 1; It is characterized in that; When data from the sample survey at the data from the sample survey of more said file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.
6. data de-duplication method according to claim 1 is characterized in that, said data block is according to preserving like property component group.
7. data de-duplication method according to claim 1 is characterized in that said meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
8. data de-duplication method according to claim 1 is characterized in that, when a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If said file destination is without overweight operation, the said file destination of direct read deleted;
If said file destination is deleted operation through overweight, then the meta file of said file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
9. data de-duplication method according to claim 1 is characterized in that, saidly goes the cycle of heavy step adjustable.
CN201110330421A 2011-10-27 2011-10-27 Data de-duplication method Pending CN102323958A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110330421A CN102323958A (en) 2011-10-27 2011-10-27 Data de-duplication method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110330421A CN102323958A (en) 2011-10-27 2011-10-27 Data de-duplication method

Publications (1)

Publication Number Publication Date
CN102323958A true CN102323958A (en) 2012-01-18

Family

ID=45451701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110330421A Pending CN102323958A (en) 2011-10-27 2011-10-27 Data de-duplication method

Country Status (1)

Country Link
CN (1) CN102323958A (en)

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999433A (en) * 2012-11-21 2013-03-27 北京航空航天大学 Redundant data deletion method and system of virtual disks
CN103020255A (en) * 2012-12-21 2013-04-03 华为技术有限公司 Hierarchical storage method and hierarchical storage device
CN103377285A (en) * 2012-04-25 2013-10-30 国际商业机器公司 Enhanced reliability in deduplication technology over storage clouds
CN103455420A (en) * 2013-08-16 2013-12-18 华为技术有限公司 Test data construction method and equipment
CN103514247A (en) * 2012-06-19 2014-01-15 国际商业机器公司 Method and system for packing deduplicated data into finite-sized container
CN103559143A (en) * 2013-11-08 2014-02-05 华为技术有限公司 Data copying management device and data copying method of data copying management device
CN103678473A (en) * 2012-09-24 2014-03-26 国际商业机器公司 Method and system for filing and recycling efficient file in deduplicating virtual media
CN103685359A (en) * 2012-09-06 2014-03-26 中兴通讯股份有限公司 Data processing method and device
CN104156420A (en) * 2014-08-06 2014-11-19 曙光信息产业(北京)有限公司 Method and device for managing transaction journal
CN104331525A (en) * 2014-12-01 2015-02-04 国家计算机网络与信息安全管理中心 Sharing method based on repeating data deletion
CN104408141A (en) * 2014-12-01 2015-03-11 国家计算机网络与信息安全管理中心 Redundancy removal file system and data deployment method thereof
CN104572679A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Public opinion data storage method and device
CN105183399A (en) * 2015-09-30 2015-12-23 北京奇艺世纪科技有限公司 Data writing and reading method and device based on elastic block storage
CN105511812A (en) * 2015-12-10 2016-04-20 浪潮(北京)电子信息产业有限公司 Method and device for optimizing big data of memory system
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
CN105786655A (en) * 2016-03-08 2016-07-20 成都云祺科技有限公司 Repeated data deleting method for virtual machine backup data
CN105912622A (en) * 2016-04-05 2016-08-31 重庆大学 Data de-duplication method for lossless compressed files
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
CN103810297B (en) * 2014-03-07 2017-02-01 华为技术有限公司 Writing method, reading method, writing device and reading device on basis of re-deleting technology
CN106445413A (en) * 2012-12-12 2017-02-22 华为技术有限公司 Processing method and device for data in trunk system
CN106610794A (en) * 2016-11-21 2017-05-03 深圳市深信服电子科技有限公司 Convergence blocking method and device for data deduplication
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN106886367A (en) * 2015-11-04 2017-06-23 Hgst荷兰公司 For the duplicate removal in memory management reference block to reference set polymerization
CN106980680A (en) * 2017-03-30 2017-07-25 联想(北京)有限公司 Date storage method and storage device
CN107944041A (en) * 2017-12-14 2018-04-20 成都雅骏新能源汽车科技股份有限公司 A kind of storage organization optimization method of HDFS
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN108431815A (en) * 2016-01-12 2018-08-21 国际商业机器公司 The duplicate removal complex data of distributed data in processor grid
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN111880743A (en) * 2020-07-29 2020-11-03 北京浪潮数据技术有限公司 Data storage method, device, equipment and storage medium
CN112380197A (en) * 2020-10-29 2021-02-19 中科热备(北京)云计算技术有限公司 Method for deleting repeated data based on front end
CN113472609A (en) * 2020-05-25 2021-10-01 汪永强 Data repeated transmission marking system for wireless communication
CN113672170A (en) * 2021-07-23 2021-11-19 复旦大学附属肿瘤医院 Redundant data marking and removing method
WO2022001548A1 (en) * 2020-06-30 2022-01-06 华为技术有限公司 Data transmission method, system, apparatus, device, and medium
CN114138552A (en) * 2021-11-11 2022-03-04 苏州浪潮智能科技有限公司 Data dynamic deduplication method, system, terminal and storage medium
CN115016330A (en) * 2022-08-10 2022-09-06 深圳市虎一科技有限公司 Automatic menu and intelligent kitchen power matching method and system
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788976A (en) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 File splitting method based on contents
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN101788976A (en) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 File splitting method based on contents
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Cited By (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377285A (en) * 2012-04-25 2013-10-30 国际商业机器公司 Enhanced reliability in deduplication technology over storage clouds
US11079953B2 (en) 2012-06-19 2021-08-03 International Business Machines Corporation Packing deduplicated data into finite-sized containers
US9880771B2 (en) 2012-06-19 2018-01-30 International Business Machines Corporation Packing deduplicated data into finite-sized containers
CN103514247A (en) * 2012-06-19 2014-01-15 国际商业机器公司 Method and system for packing deduplicated data into finite-sized container
CN103685359A (en) * 2012-09-06 2014-03-26 中兴通讯股份有限公司 Data processing method and device
CN103685359B (en) * 2012-09-06 2018-04-10 中兴通讯股份有限公司 Data processing method and device
CN103678473B (en) * 2012-09-24 2017-04-12 国际商业机器公司 Method and system for filing and recycling efficient file in deduplicating virtual media
CN103678473A (en) * 2012-09-24 2014-03-26 国际商业机器公司 Method and system for filing and recycling efficient file in deduplicating virtual media
US9678672B2 (en) 2012-09-24 2017-06-13 International Business Machines Corporation Selective erasure of expired files or extents in deduplicating virutal media for efficient file reclamation
CN102999433A (en) * 2012-11-21 2013-03-27 北京航空航天大学 Redundant data deletion method and system of virtual disks
CN102999433B (en) * 2012-11-21 2015-06-17 北京航空航天大学 Redundant data deletion method and system of virtual disks
CN106445413A (en) * 2012-12-12 2017-02-22 华为技术有限公司 Processing method and device for data in trunk system
CN106445413B (en) * 2012-12-12 2019-10-25 华为技术有限公司 Data processing method and device in group system
CN103020255A (en) * 2012-12-21 2013-04-03 华为技术有限公司 Hierarchical storage method and hierarchical storage device
CN103020255B (en) * 2012-12-21 2016-03-02 华为技术有限公司 Classification storage means and device
CN103455420A (en) * 2013-08-16 2013-12-18 华为技术有限公司 Test data construction method and equipment
CN103455420B (en) * 2013-08-16 2016-06-15 华为技术有限公司 A kind of building method testing data and equipment
CN104572679B (en) * 2013-10-16 2017-11-03 北大方正集团有限公司 Public sentiment data storage method and device
CN104572679A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Public opinion data storage method and device
CN103559143A (en) * 2013-11-08 2014-02-05 华为技术有限公司 Data copying management device and data copying method of data copying management device
CN103810297B (en) * 2014-03-07 2017-02-01 华为技术有限公司 Writing method, reading method, writing device and reading device on basis of re-deleting technology
CN104156420B (en) * 2014-08-06 2017-10-03 曙光信息产业(北京)有限公司 The management method and device of transaction journal
CN104156420A (en) * 2014-08-06 2014-11-19 曙光信息产业(北京)有限公司 Method and device for managing transaction journal
CN104331525B (en) * 2014-12-01 2018-01-16 国家计算机网络与信息安全管理中心 Sharing method based on data de-duplication
CN104331525A (en) * 2014-12-01 2015-02-04 国家计算机网络与信息安全管理中心 Sharing method based on repeating data deletion
CN104408141B (en) * 2014-12-01 2018-04-17 国家计算机网络与信息安全管理中心 One kind disappears superfluous file system and its data deployment method
CN104408141A (en) * 2014-12-01 2015-03-11 国家计算机网络与信息安全管理中心 Redundancy removal file system and data deployment method thereof
US11243915B2 (en) 2014-12-10 2022-02-08 International Business Machines Corporation Method and apparatus for data deduplication
US10089321B2 (en) 2014-12-10 2018-10-02 International Business Machines Corporation Method and apparatus for data deduplication
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
CN105183399A (en) * 2015-09-30 2015-12-23 北京奇艺世纪科技有限公司 Data writing and reading method and device based on elastic block storage
CN106886367A (en) * 2015-11-04 2017-06-23 Hgst荷兰公司 For the duplicate removal in memory management reference block to reference set polymerization
CN105511812A (en) * 2015-12-10 2016-04-20 浪潮(北京)电子信息产业有限公司 Method and device for optimizing big data of memory system
CN105511812B (en) * 2015-12-10 2018-12-18 浪潮(北京)电子信息产业有限公司 A kind of storage system big data optimization method and device
CN108431815B (en) * 2016-01-12 2022-10-11 国际商业机器公司 Deduplication of distributed data in a processor grid
CN108431815A (en) * 2016-01-12 2018-08-21 国际商业机器公司 The duplicate removal complex data of distributed data in processor grid
CN105786655A (en) * 2016-03-08 2016-07-20 成都云祺科技有限公司 Repeated data deleting method for virtual machine backup data
CN105912622A (en) * 2016-04-05 2016-08-31 重庆大学 Data de-duplication method for lossless compressed files
CN106708927B (en) * 2016-11-18 2021-01-05 北京二六三企业通信有限公司 File deduplication processing method and device
CN106708927A (en) * 2016-11-18 2017-05-24 北京二六三企业通信有限公司 Duplicate removal processing method and duplicate removal processing device for files
CN106610794A (en) * 2016-11-21 2017-05-03 深圳市深信服电子科技有限公司 Convergence blocking method and device for data deduplication
CN106610794B (en) * 2016-11-21 2020-05-15 深信服科技股份有限公司 Convergence blocking method and device for data deduplication
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN106980680A (en) * 2017-03-30 2017-07-25 联想(北京)有限公司 Date storage method and storage device
CN106980680B (en) * 2017-03-30 2020-11-20 联想(北京)有限公司 Data storage method and storage device
CN108563649B (en) * 2017-12-12 2021-12-07 南京富士通南大软件技术有限公司 Offline duplicate removal method based on GlusterFS distributed file system
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN107944041B (en) * 2017-12-14 2021-11-09 成都雅骏新能源汽车科技股份有限公司 Storage structure optimization method of HDFS (Hadoop distributed File System)
CN107944041A (en) * 2017-12-14 2018-04-20 成都雅骏新能源汽车科技股份有限公司 A kind of storage organization optimization method of HDFS
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN113472609A (en) * 2020-05-25 2021-10-01 汪永强 Data repeated transmission marking system for wireless communication
CN113472609B (en) * 2020-05-25 2024-03-19 汪永强 Data repeated sending marking system for wireless communication
WO2022001548A1 (en) * 2020-06-30 2022-01-06 华为技术有限公司 Data transmission method, system, apparatus, device, and medium
CN111880743A (en) * 2020-07-29 2020-11-03 北京浪潮数据技术有限公司 Data storage method, device, equipment and storage medium
CN112380197A (en) * 2020-10-29 2021-02-19 中科热备(北京)云计算技术有限公司 Method for deleting repeated data based on front end
CN113672170A (en) * 2021-07-23 2021-11-19 复旦大学附属肿瘤医院 Redundant data marking and removing method
CN114138552A (en) * 2021-11-11 2022-03-04 苏州浪潮智能科技有限公司 Data dynamic deduplication method, system, terminal and storage medium
CN114138552B (en) * 2021-11-11 2024-01-12 苏州浪潮智能科技有限公司 Data dynamic repeating and deleting method, system, terminal and storage medium
CN115016330A (en) * 2022-08-10 2022-09-06 深圳市虎一科技有限公司 Automatic menu and intelligent kitchen power matching method and system
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN102323958A (en) Data de-duplication method
US11288235B2 (en) Synchronized data deduplication
US10031675B1 (en) Method and system for tiering data
CN102222085B (en) Data de-duplication method based on combination of similarity and locality
CN101989929B (en) Disaster recovery data backup method and system
US8983952B1 (en) System and method for partitioning backup data streams in a deduplication based storage system
US10303797B1 (en) Clustering files in deduplication systems
CN101777017B (en) Rapid recovery method of continuous data protection system
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
CN101963982B (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102831222A (en) Differential compression method based on data de-duplication
CN109445702B (en) block-level data deduplication storage system
CN104932956A (en) Big-data-oriented cloud disaster tolerant backup method
CN104932841A (en) Saving type duplicated data deleting method in cloud storage system
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
CN101866358A (en) Multidimensional interval querying method and system thereof
US10838923B1 (en) Poor deduplication identification
CN107885619A (en) A kind of data compaction duplicate removal and the method and system of mirror image remote backup protection
CN105630810A (en) Method for uploading mass small files in distributed storage system
CN113672170A (en) Redundant data marking and removing method
CN105095027A (en) Data backup method and apparatus
CN109445703B (en) A kind of Delta compression storage assembly based on block grade data deduplication
CN102722450B (en) Storage method for redundancy deletion block device based on location-sensitive hash
CN103049508A (en) Method and device for processing data
CN114442937B (en) File caching method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120118