CN102323958A - Data de-duplication method - Google Patents
Data de-duplication method Download PDFInfo
- Publication number
- CN102323958A CN102323958A CN201110330421A CN201110330421A CN102323958A CN 102323958 A CN102323958 A CN 102323958A CN 201110330421 A CN201110330421 A CN 201110330421A CN 201110330421 A CN201110330421 A CN 201110330421A CN 102323958 A CN102323958 A CN 102323958A
- Authority
- CN
- China
- Prior art keywords
- data
- file
- duplication
- cryptographic hash
- duplication method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a data de-duplication method, which comprises the following steps of: writing a file, lengthening the file, dividing the file into a plurality of data blocks with different lengths, and calculating Hash values of the data blocks; sampling the Hash values, and thus forming the sampling data of the file; by comparing the sampling data of the file with the sampling data of the conventional file, positioning a similarity group of the file; by comparing the Hash value of the file with the Hash value of the similarity group in a meta database, determining duplicated data blocks; de-duplicating, and storing non-duplicated data blocks; and generating a meta file, and storing the Hash values of the non-duplicated data blocks into the meta database. By adoption of the data de-duplication method, the occupation of de-duplication operation on resources of a system can be dynamically adjusted, the performance of in-line service is preferentially guaranteed, and the influence of the in-line service of the service is minimized. The data de-duplication method has the characteristics of high reliability, good stability and higher de-duplication rate.
Description
Technical field
The present invention relates to a kind of data-erasure method, refer to a kind of data de-duplication method especially.
Background technology
Data de-duplication (De-duplication) is a kind of data reduction technology, is intended to reduce the memory capacity of using in the storage system.Its data through repeating in the deletion storage system only keeps portion wherein, thereby eliminates redundant data.Data de-duplication technology can reduce the consumption to amount of physical memory to a great extent.
Data de-duplication technology can be divided into online treatment mode (In-Line) and post processing mode (Post-Process) according to data processing method.
The data de-duplication method of online treatment mode is before data write disk, to carry out data de-duplication.The data de-duplication of online treatment has reduced data volume to a certain extent, but also has a problem simultaneously, goes retry itself can reduce data throughput rate, causes the reduction of service feature.In addition, owing to data de-duplication carried out before being written to disk, so the data de-duplication processing procedure itself is exactly a Single Point of Faliure.
The data de-duplication method of post processing mode is after data are write disk, to carry out data de-duplication again.Data are written to interim disk space earlier, begin data de-duplication afterwards again, and the data that will pass through data de-duplication at last write disk.Because data de-duplication is on independent memory device, to carry out after data write disk again, therefore generally regular traffic is handled exerting an influence hardly.But because present post processing mode can not be adjusted taking dynamically of system resource, do not possess the function of the online service feature of preferential guarantee yet, when system's occupancy is excessive, still can impact at line service system.
Data de-duplication technology is according to going heavy granularity can be divided into file-level, blocks of files level, byte level.
The data de-duplication of file-level is that unit detects, deletes repeating data with the file.The advantage of this mode is that algorithm is simple, and computing velocity is fast, and shortcoming is that heavily rate is low.
The data de-duplication of blocks of files level is divided into data block with a file by different modes, is that unit detects with the data block.The advantage of this method is that computing velocity is fast, responsive to the data variation.
The difference that the blocks of files level is heavily deleted according to partitioned mode is divided into fixed length piecemeal and elongated partitioned mode again.
See also Fig. 3, the fixed length partitioned mode is divided into the piece of regular length with file, but the method is very responsive to the insertion and the deletion of data, and the data multiplicity is very low in the practical application, and it is very limited heavily to delete effect.
Repeated content is searched and deleted to the data de-duplication of byte level from the byte level, generally generates the difference partial content through the difference Compression Strategies.It is that heavily rate is higher that byte level goes heavy advantage, and shortcoming is that heavy speed is slower.
In addition, traditional data de-duplication method provides data service through single physical equipment, when carrying out data de-duplication, can form the fault single-point, and system reliability has been brought challenge.
Summary of the invention
The objective of the invention is to overcome the defective of prior art; And a kind of data de-duplication method is provided; Realized dynamically to adjust and heavily deleted operation, preferentially ensured online performance of services, system has been reduced to minimum data de-duplication method in the influence of line service the taking of system resource; Have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.
The technical scheme that realizes above-mentioned purpose is:
A kind of data de-duplication method of the present invention comprises:
Write file, said file is carried out elongated piecemeal, form the plurality of data piece of different length and calculate the cryptographic hash of said data block;
Through said cryptographic hash is sampled, form the data from the sample survey of said file;
Through the data from the sample survey of more said file and the data from the sample survey of existing file, locate a similarity group of said file;
The cryptographic hash of similarity group is confirmed the repeating data piece described in a cryptographic hash through more said file and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of said non-repeating data piece is saved in the said metadatabase.
Above-mentioned elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.
When calculating the cryptographic hash of above-mentioned data block; Through the cryptographic hash before the moving window slip in the said sliding window technique; Slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides, improved the operation efficiency of heavily deleting operation.
When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of a said data block size, the data in said data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.
When data from the sample survey at the data from the sample survey of more above-mentioned file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.
Above-mentioned data block is according to preserving like property component group.
Above-mentioned meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
When a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If said file destination is without overweight operation, the said file destination of direct read deleted;
If said file destination is deleted operation through overweight, then the meta file of said file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
Above-mentionedly go the cycle of heavy step adjustable.
The present invention has been owing to adopted above technical scheme, makes it have following beneficial effect to be:
Data de-duplication among the present invention is heavily deleted technology based on the aftertreatment formula of strategy.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business, can dynamically adjust and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.Sliding window technique changes insensitive to file content, insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.When read-write is heavily deleted file and is write; Need not the data block that this document is all parses; Only need orient the data block of this write operation influence, carry out data manipulation among a small circle, these measures have ensured to the full extent opens the professional readwrite performance of heavily deleting function system.Realized dynamically to adjust and heavily deleted operation taking system resource; The online performance of services of preferential guarantee; System is reduced to minimum data de-duplication method in the influence of line service, have reliability height, good stability, go that heavily rate is higher, the characteristics that performance is excellent.
Description of drawings
Fig. 1 is the process flow diagram of a kind of data de-duplication method of the present invention;
Fig. 2 is the elongated piecemeal technical schematic diagram of a kind of data de-duplication method of the present invention;
Fig. 3 is the fixed length piecemeal technical schematic diagram of prior art;
Fig. 4 is the similarity detection technique schematic diagram of prior art.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment the present invention is described further.
See also Fig. 1, a kind of data de-duplication method of the present invention may further comprise the steps:
At first, write file, this document is carried out elongated piecemeal, form the plurality of data piece of different length and the cryptographic hash of computational data piece;
Through cryptographic hash is sampled, form the data from the sample survey of this document;
Through the data from the sample survey of comparison document and the data from the sample survey of existing file, a similarity group of locating file;
If the similarity of the data from the sample survey of the data from the sample survey of file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of file.
The cryptographic hash of similarity group is confirmed the repeating data piece in a cryptographic hash through comparison document and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of non-repeating data piece is saved in the metadatabase.
Meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
Data block is according to preserving like property component group.
When a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If file destination is without overweight operation, the direct read file destination deleted;
If file destination is deleted operation through overweight, then the meta file of file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
Above-mentionedly go the cycle of heavy step adjustable.
See also Fig. 2, elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content, and this technology changes insensitive to file content, and insertion or deleted data only can have influence on less data block, and the remainder data piece is unaffected.
When calculating the cryptographic hash of above-mentioned data block, through the cryptographic hash before the moving window slip in the sliding window technique, slip into byte value and the cryptographic hash that skids off the inside byte arrays after byte value calculating moving window slides, improved the operation efficiency of heavily deleting operation.
Adopt the two divisor algorithms (TTTD) of dual threshold, counterweight is deleted operation and has been carried out further performance optimization.When calculating the cryptographic hash of above-mentioned data block, limit the minimum value of data block size, the data in data block head minimum value interval are not carried out cryptographic hash calculating, have reduced computing cost, have improved the operational efficiency of heavily deleting operation.
Method is sampled to the cryptographic hash (HASH) of file data blocks through certain algorithm, the cryptographic hash contrast of data with existing piece in these sample values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.
When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unduplicated data block, and refresh corresponding metadata.
This technology has reduced the data query number of times in the identification repeating data process, heavily deletes compared with techniques with traditional online data, has promoted the performance of data de-duplication greatly.
The data de-duplication functional module is deployed in the system on each clustered node, and any node or several node failures in the cluster are heavily deleted business and the user writable business all can not be affected.
Heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.Heavily delete and adopted the elongated piecemeal of file technology, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.
See also Fig. 4, data de-duplication of the present invention uses the document similarity property detection technique that obtains patent to carry out the identification of repeating data.This detection technique is sampled to the HASH value of file data blocks through certain algorithm, the HASH value contrast of data with existing piece in these sampling HASH values and the current system is confirmed the similarity of file.According to the similarity of file, can the file after heavily deleting be divided into different similarity groups.In each similarity group, the sampling HASH of each file has constituted the similarity index of this group.All Files piecemeal HASH in the same similarity group is kept in the metadatabase of this group, supplies new file to write the fashionable HASH of carrying out comparison.
When having file heavily to delete; Earlier the sampling HASH value of this document and the similarity index of each similarity group file are compared; If the similarity of this document and a certain similarity group surpasses certain numerical value, confirm that then this document belongs to this group, again HASH value in the HASH value of each piecemeal in this document and this group metadata storehouse is compared at last; Store unique data block, and refresh corresponding metadata.
This technology has reduced the data query number of times in the identification repeating data process, heavily deletes compared with techniques with traditional online data, has promoted the performance of data de-duplication greatly.
We use data de-duplication ratio (be called for short and heavily delete ratio) to weigh the effect of data de-duplication usually.If be illustrated in the required space size of storage data in the heritage storage system, be illustrated in the required space size of storage identical data in the storage system that band heavily deletes with the total amount of data after heavily deleting with the total amount of data before heavily deleting.Variance rate between these two numerical value promptly is the data de-duplication ratio.
Total amount of data after the data de-duplication ratio equals heavily to delete preceding total amount of data and heavily delete;
When the data in the file system being handled through the data de-duplication characteristic; Can there be very big difference in the quantity of repeating data section because of the difference of data character in the data set, and this depends on the type of data file and the application program of creating these files usually.Analysis to concrete application scenarios helps us to understand effect and the value of data de-duplication characteristic in these scenes.
Under some certain applications scene; For example: from one group of Backup Images of certain database, when writing data in the file system, the advantage of data de-duplication is often very obvious; Because each new write operation only can write the new data section that this operation is introduced; And in the traditional data library backup was used, the differential different time of data segment between two backups often had only 1%-2%, although high rate of change also can exist.Under such application scenarios, heavily delete more full confident efficiently than military order investor.
On the contrary, under another kind of application scenarios, for example: preserve the material database of tens thousand of parts of photos, the effect that data de-duplication can be obtained is then barely satisfactory.Because the repeating data hop count amount that can find between different photos is very limited.What this will finally be presented as poor efficiency heavily deletes ratio.
Therefore, when using data de-duplication technology, need to carry out concrete analysis to concrete application scenarios.We are recommended in two kinds of application scenarioss and open the data de-duplication function, and the one, the back-up application scene, under this kind scene, the repetition rate of data is high, and it is fairly obvious heavily to delete effect; The 2nd, the virtual machine application scenarios under this kind scene, can be deposited the copy of a large amount of virtual machine file and these files in the storage system, and the data repetition rate is high, and it is obvious heavily to delete effect.
Data de-duplication has used multiple technologies to optimize performance, can bring any influence hardly to the online performance of services of system.
At first, heavily delete dynamically to adjust and heavily delete operation, preferentially ensure online performance of services, can the intelligence adjustment heavily delete operation taking system resource to the taking of system resource.
Secondly, heavily delete and adopted the elongated piecemeal technology of leading file, utilized slip piecemeal technology and Adter algorithm and TTTD algorithm efficiently, on the file block operation efficiency, be superior to traditional data de-duplication technology.
In addition; Heavily delete and use leading document similarity property detection technique to detect repeating data, this technology is divided into a plurality of similarity groups through Similarity Detection Algorithm with the file in the storage system, in group, carries out the data comparison; More help the identification of repeating data, and reduced data comparison number of times.Compare with traditional data de-duplication technology, go that heavily rate is higher, performance is more excellent.
In the tradition data de-duplication product, through single physical equipment data service is provided, at this moment, data de-duplication software and the physical equipment that carries this software all can become the fault single-point, and system reliability has been brought challenge.
The present invention combines data de-duplication with the Active-Active Clustering, system-level reliability is provided.In the cluster environment of multinode, as long as any node is still in normal operation in the system, data de-duplication and heavily delete data write and can both carry out has smoothly ensured the continuity of client traffic so.
Data de-duplication function of the present invention is through eliminating the redundant data in the data space, and the user can benefit from storage space efficient.This will directly be presented as the reduction of initial stage storage purchase cost, and data de-duplication function effectively control data increases, and has also delayed follow-up storage dilatation demand.In addition, the reduction of memory space requirements makes the Unsupervised a large amount of memory device of user, the reduction that has brought O&M costs such as space, electric power, refrigeration and maintenance management.Reduce TCO to greatest extent.
When the data de-duplication technology among the present invention is applied to the back-up application scene, go heavy effect very obvious.In this application scenarios, backup server backs user data in the NAS storage space, and the backup policy through certain is equipped with entirely, increases fully, and the data multiplicity is high.
Data de-duplication among the present invention also has bigger advantage in the virtual machine application scenarios.In this application, the user is left a large amount of virtual machine file in the memory device in, and these files comprise identical OS system usually, and this just means a large amount of repeating datas.Heavily delete and optimize to virtual machine file, but the repeating data in this class file of efficient identification.
Data de-duplication is based on the aftertreatment formula of strategy and heavily deletes technology.The customized justice of user is heavily deleted the cycle of operation, and the startup opportunity of operation is heavily deleted in control.Data de-duplication operations is at system's running background, and is transparent fully to business.In addition, differently with traditional aftertreatment formula online data technology of heavily deleting be that heavily deleting among the present invention can dynamically be adjusted and heavily delete operation to the taking of system resource, and preferentially ensures online performance of services, reduces to minimum to system in the influence of line service.
More than combine accompanying drawing embodiment that the present invention is specified, those skilled in the art can make the many variations example to the present invention according to above-mentioned explanation.Thereby some details among the embodiment should not constitute qualification of the present invention, and the scope that the present invention will define with appended claims is as protection scope of the present invention.
Claims (9)
1. a data de-duplication method is characterized in that, may further comprise the steps:
Write file, said file is carried out elongated piecemeal, form the plurality of data piece of different length and calculate the cryptographic hash of said data block;
Through said cryptographic hash is sampled, form the data from the sample survey of said file;
Through the data from the sample survey of more said file and the data from the sample survey of existing file, locate a similarity group of said file;
The cryptographic hash of similarity group is confirmed the repeating data piece described in a cryptographic hash through more said file and the metadatabase;
Remove heavily to preserve non-repeating data piece;
The generator file, and the cryptographic hash of said non-repeating data piece is saved in the said metadatabase.
2. data de-duplication method according to claim 1 is characterized in that, said elongated piecemeal adopts sliding window technique, carries out the data cutting according to file content.
3. data de-duplication method according to claim 2; It is characterized in that; When calculating the cryptographic hash of said data block; Through the cryptographic hash before the moving window slip in the said sliding window technique, slip into byte value and skid off the cryptographic hash that byte value calculates the inside byte arrays after said moving window slides.
4. data de-duplication method according to claim 1 is characterized in that, when calculating the cryptographic hash of said data block, limits the minimum value of a said data block size, and the data in said data block head minimum value interval are not carried out cryptographic hash calculating.
5. data de-duplication method according to claim 1; It is characterized in that; When data from the sample survey at the data from the sample survey of more said file and existing file; If the similarity of the data from the sample survey of the data from the sample survey of said file and current existing file surpasses certain numerical value, confirm that then the corresponding data set of data from the sample survey of current existing file is a similarity group of said file.
6. data de-duplication method according to claim 1 is characterized in that, said data block is according to preserving like property component group.
7. data de-duplication method according to claim 1 is characterized in that said meta file is the data description of original, has comprised the contents such as each item file attribute of original, and has write down the deposit position of each data block of original.
8. data de-duplication method according to claim 1 is characterized in that, when a read-write requests is received by system, further may further comprise the steps:
Judge whether file destination is the file of deleting operation through overweight;
If said file destination is without overweight operation, the said file destination of direct read deleted;
If said file destination is deleted operation through overweight, then the meta file of said file destination is resolved the target data block of locating read-write request;
Accomplish operations such as read-write.
9. data de-duplication method according to claim 1 is characterized in that, saidly goes the cycle of heavy step adjustable.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110330421A CN102323958A (en) | 2011-10-27 | 2011-10-27 | Data de-duplication method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110330421A CN102323958A (en) | 2011-10-27 | 2011-10-27 | Data de-duplication method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102323958A true CN102323958A (en) | 2012-01-18 |
Family
ID=45451701
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110330421A Pending CN102323958A (en) | 2011-10-27 | 2011-10-27 | Data de-duplication method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102323958A (en) |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999433A (en) * | 2012-11-21 | 2013-03-27 | 北京航空航天大学 | Redundant data deletion method and system of virtual disks |
CN103020255A (en) * | 2012-12-21 | 2013-04-03 | 华为技术有限公司 | Hierarchical storage method and hierarchical storage device |
CN103377285A (en) * | 2012-04-25 | 2013-10-30 | 国际商业机器公司 | Enhanced reliability in deduplication technology over storage clouds |
CN103455420A (en) * | 2013-08-16 | 2013-12-18 | 华为技术有限公司 | Test data construction method and equipment |
CN103514247A (en) * | 2012-06-19 | 2014-01-15 | 国际商业机器公司 | Method and system for packing deduplicated data into finite-sized container |
CN103559143A (en) * | 2013-11-08 | 2014-02-05 | 华为技术有限公司 | Data copying management device and data copying method of data copying management device |
CN103678473A (en) * | 2012-09-24 | 2014-03-26 | 国际商业机器公司 | Method and system for filing and recycling efficient file in deduplicating virtual media |
CN103685359A (en) * | 2012-09-06 | 2014-03-26 | 中兴通讯股份有限公司 | Data processing method and device |
CN104156420A (en) * | 2014-08-06 | 2014-11-19 | 曙光信息产业(北京)有限公司 | Method and device for managing transaction journal |
CN104331525A (en) * | 2014-12-01 | 2015-02-04 | 国家计算机网络与信息安全管理中心 | Sharing method based on repeating data deletion |
CN104408141A (en) * | 2014-12-01 | 2015-03-11 | 国家计算机网络与信息安全管理中心 | Redundancy removal file system and data deployment method thereof |
CN104572679A (en) * | 2013-10-16 | 2015-04-29 | 北大方正集团有限公司 | Public opinion data storage method and device |
CN105183399A (en) * | 2015-09-30 | 2015-12-23 | 北京奇艺世纪科技有限公司 | Data writing and reading method and device based on elastic block storage |
CN105511812A (en) * | 2015-12-10 | 2016-04-20 | 浪潮(北京)电子信息产业有限公司 | Method and device for optimizing big data of memory system |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
CN105786655A (en) * | 2016-03-08 | 2016-07-20 | 成都云祺科技有限公司 | Repeated data deleting method for virtual machine backup data |
CN105912622A (en) * | 2016-04-05 | 2016-08-31 | 重庆大学 | Data de-duplication method for lossless compressed files |
CN106302202A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Data current-limiting method and device |
CN103810297B (en) * | 2014-03-07 | 2017-02-01 | 华为技术有限公司 | Writing method, reading method, writing device and reading device on basis of re-deleting technology |
CN106445413A (en) * | 2012-12-12 | 2017-02-22 | 华为技术有限公司 | Processing method and device for data in trunk system |
CN106610794A (en) * | 2016-11-21 | 2017-05-03 | 深圳市深信服电子科技有限公司 | Convergence blocking method and device for data deduplication |
CN106708927A (en) * | 2016-11-18 | 2017-05-24 | 北京二六三企业通信有限公司 | Duplicate removal processing method and duplicate removal processing device for files |
CN106886367A (en) * | 2015-11-04 | 2017-06-23 | Hgst荷兰公司 | For the duplicate removal in memory management reference block to reference set polymerization |
CN106980680A (en) * | 2017-03-30 | 2017-07-25 | 联想(北京)有限公司 | Date storage method and storage device |
CN107944041A (en) * | 2017-12-14 | 2018-04-20 | 成都雅骏新能源汽车科技股份有限公司 | A kind of storage organization optimization method of HDFS |
CN108241615A (en) * | 2016-12-23 | 2018-07-03 | 中国电信股份有限公司 | Data duplicate removal method and device |
CN108431815A (en) * | 2016-01-12 | 2018-08-21 | 国际商业机器公司 | The duplicate removal complex data of distributed data in processor grid |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN110083743A (en) * | 2019-03-28 | 2019-08-02 | 哈尔滨工业大学(深圳) | A kind of quick set of metadata of similar data detection method based on uniform sampling |
CN111880743A (en) * | 2020-07-29 | 2020-11-03 | 北京浪潮数据技术有限公司 | Data storage method, device, equipment and storage medium |
CN112380197A (en) * | 2020-10-29 | 2021-02-19 | 中科热备(北京)云计算技术有限公司 | Method for deleting repeated data based on front end |
CN113472609A (en) * | 2020-05-25 | 2021-10-01 | 汪永强 | Data repeated transmission marking system for wireless communication |
CN113672170A (en) * | 2021-07-23 | 2021-11-19 | 复旦大学附属肿瘤医院 | Redundant data marking and removing method |
WO2022001548A1 (en) * | 2020-06-30 | 2022-01-06 | 华为技术有限公司 | Data transmission method, system, apparatus, device, and medium |
CN114138552A (en) * | 2021-11-11 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Data dynamic deduplication method, system, terminal and storage medium |
CN115016330A (en) * | 2022-08-10 | 2022-09-06 | 深圳市虎一科技有限公司 | Automatic menu and intelligent kitchen power matching method and system |
CN117369731A (en) * | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101788976A (en) * | 2010-02-10 | 2010-07-28 | 北京播思软件技术有限公司 | File splitting method based on contents |
CN101882141A (en) * | 2009-05-08 | 2010-11-10 | 北京众志和达信息技术有限公司 | Method and system for implementing repeated data deletion |
CN102222085A (en) * | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
-
2011
- 2011-10-27 CN CN201110330421A patent/CN102323958A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882141A (en) * | 2009-05-08 | 2010-11-10 | 北京众志和达信息技术有限公司 | Method and system for implementing repeated data deletion |
CN101788976A (en) * | 2010-02-10 | 2010-07-28 | 北京播思软件技术有限公司 | File splitting method based on contents |
CN102222085A (en) * | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
Cited By (63)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377285A (en) * | 2012-04-25 | 2013-10-30 | 国际商业机器公司 | Enhanced reliability in deduplication technology over storage clouds |
US11079953B2 (en) | 2012-06-19 | 2021-08-03 | International Business Machines Corporation | Packing deduplicated data into finite-sized containers |
US9880771B2 (en) | 2012-06-19 | 2018-01-30 | International Business Machines Corporation | Packing deduplicated data into finite-sized containers |
CN103514247A (en) * | 2012-06-19 | 2014-01-15 | 国际商业机器公司 | Method and system for packing deduplicated data into finite-sized container |
CN103685359A (en) * | 2012-09-06 | 2014-03-26 | 中兴通讯股份有限公司 | Data processing method and device |
CN103685359B (en) * | 2012-09-06 | 2018-04-10 | 中兴通讯股份有限公司 | Data processing method and device |
CN103678473B (en) * | 2012-09-24 | 2017-04-12 | 国际商业机器公司 | Method and system for filing and recycling efficient file in deduplicating virtual media |
CN103678473A (en) * | 2012-09-24 | 2014-03-26 | 国际商业机器公司 | Method and system for filing and recycling efficient file in deduplicating virtual media |
US9678672B2 (en) | 2012-09-24 | 2017-06-13 | International Business Machines Corporation | Selective erasure of expired files or extents in deduplicating virutal media for efficient file reclamation |
CN102999433A (en) * | 2012-11-21 | 2013-03-27 | 北京航空航天大学 | Redundant data deletion method and system of virtual disks |
CN102999433B (en) * | 2012-11-21 | 2015-06-17 | 北京航空航天大学 | Redundant data deletion method and system of virtual disks |
CN106445413A (en) * | 2012-12-12 | 2017-02-22 | 华为技术有限公司 | Processing method and device for data in trunk system |
CN106445413B (en) * | 2012-12-12 | 2019-10-25 | 华为技术有限公司 | Data processing method and device in group system |
CN103020255A (en) * | 2012-12-21 | 2013-04-03 | 华为技术有限公司 | Hierarchical storage method and hierarchical storage device |
CN103020255B (en) * | 2012-12-21 | 2016-03-02 | 华为技术有限公司 | Classification storage means and device |
CN103455420A (en) * | 2013-08-16 | 2013-12-18 | 华为技术有限公司 | Test data construction method and equipment |
CN103455420B (en) * | 2013-08-16 | 2016-06-15 | 华为技术有限公司 | A kind of building method testing data and equipment |
CN104572679B (en) * | 2013-10-16 | 2017-11-03 | 北大方正集团有限公司 | Public sentiment data storage method and device |
CN104572679A (en) * | 2013-10-16 | 2015-04-29 | 北大方正集团有限公司 | Public opinion data storage method and device |
CN103559143A (en) * | 2013-11-08 | 2014-02-05 | 华为技术有限公司 | Data copying management device and data copying method of data copying management device |
CN103810297B (en) * | 2014-03-07 | 2017-02-01 | 华为技术有限公司 | Writing method, reading method, writing device and reading device on basis of re-deleting technology |
CN104156420B (en) * | 2014-08-06 | 2017-10-03 | 曙光信息产业(北京)有限公司 | The management method and device of transaction journal |
CN104156420A (en) * | 2014-08-06 | 2014-11-19 | 曙光信息产业(北京)有限公司 | Method and device for managing transaction journal |
CN104331525B (en) * | 2014-12-01 | 2018-01-16 | 国家计算机网络与信息安全管理中心 | Sharing method based on data de-duplication |
CN104331525A (en) * | 2014-12-01 | 2015-02-04 | 国家计算机网络与信息安全管理中心 | Sharing method based on repeating data deletion |
CN104408141B (en) * | 2014-12-01 | 2018-04-17 | 国家计算机网络与信息安全管理中心 | One kind disappears superfluous file system and its data deployment method |
CN104408141A (en) * | 2014-12-01 | 2015-03-11 | 国家计算机网络与信息安全管理中心 | Redundancy removal file system and data deployment method thereof |
US11243915B2 (en) | 2014-12-10 | 2022-02-08 | International Business Machines Corporation | Method and apparatus for data deduplication |
US10089321B2 (en) | 2014-12-10 | 2018-10-02 | International Business Machines Corporation | Method and apparatus for data deduplication |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
CN106302202A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Data current-limiting method and device |
CN105183399A (en) * | 2015-09-30 | 2015-12-23 | 北京奇艺世纪科技有限公司 | Data writing and reading method and device based on elastic block storage |
CN106886367A (en) * | 2015-11-04 | 2017-06-23 | Hgst荷兰公司 | For the duplicate removal in memory management reference block to reference set polymerization |
CN105511812A (en) * | 2015-12-10 | 2016-04-20 | 浪潮(北京)电子信息产业有限公司 | Method and device for optimizing big data of memory system |
CN105511812B (en) * | 2015-12-10 | 2018-12-18 | 浪潮(北京)电子信息产业有限公司 | A kind of storage system big data optimization method and device |
CN108431815B (en) * | 2016-01-12 | 2022-10-11 | 国际商业机器公司 | Deduplication of distributed data in a processor grid |
CN108431815A (en) * | 2016-01-12 | 2018-08-21 | 国际商业机器公司 | The duplicate removal complex data of distributed data in processor grid |
CN105786655A (en) * | 2016-03-08 | 2016-07-20 | 成都云祺科技有限公司 | Repeated data deleting method for virtual machine backup data |
CN105912622A (en) * | 2016-04-05 | 2016-08-31 | 重庆大学 | Data de-duplication method for lossless compressed files |
CN106708927B (en) * | 2016-11-18 | 2021-01-05 | 北京二六三企业通信有限公司 | File deduplication processing method and device |
CN106708927A (en) * | 2016-11-18 | 2017-05-24 | 北京二六三企业通信有限公司 | Duplicate removal processing method and duplicate removal processing device for files |
CN106610794A (en) * | 2016-11-21 | 2017-05-03 | 深圳市深信服电子科技有限公司 | Convergence blocking method and device for data deduplication |
CN106610794B (en) * | 2016-11-21 | 2020-05-15 | 深信服科技股份有限公司 | Convergence blocking method and device for data deduplication |
CN108241615A (en) * | 2016-12-23 | 2018-07-03 | 中国电信股份有限公司 | Data duplicate removal method and device |
CN106980680A (en) * | 2017-03-30 | 2017-07-25 | 联想(北京)有限公司 | Date storage method and storage device |
CN106980680B (en) * | 2017-03-30 | 2020-11-20 | 联想(北京)有限公司 | Data storage method and storage device |
CN108563649B (en) * | 2017-12-12 | 2021-12-07 | 南京富士通南大软件技术有限公司 | Offline duplicate removal method based on GlusterFS distributed file system |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN107944041B (en) * | 2017-12-14 | 2021-11-09 | 成都雅骏新能源汽车科技股份有限公司 | Storage structure optimization method of HDFS (Hadoop distributed File System) |
CN107944041A (en) * | 2017-12-14 | 2018-04-20 | 成都雅骏新能源汽车科技股份有限公司 | A kind of storage organization optimization method of HDFS |
CN110083743A (en) * | 2019-03-28 | 2019-08-02 | 哈尔滨工业大学(深圳) | A kind of quick set of metadata of similar data detection method based on uniform sampling |
CN110083743B (en) * | 2019-03-28 | 2021-11-16 | 哈尔滨工业大学(深圳) | Rapid similar data detection method based on unified sampling |
CN113472609A (en) * | 2020-05-25 | 2021-10-01 | 汪永强 | Data repeated transmission marking system for wireless communication |
CN113472609B (en) * | 2020-05-25 | 2024-03-19 | 汪永强 | Data repeated sending marking system for wireless communication |
WO2022001548A1 (en) * | 2020-06-30 | 2022-01-06 | 华为技术有限公司 | Data transmission method, system, apparatus, device, and medium |
CN111880743A (en) * | 2020-07-29 | 2020-11-03 | 北京浪潮数据技术有限公司 | Data storage method, device, equipment and storage medium |
CN112380197A (en) * | 2020-10-29 | 2021-02-19 | 中科热备(北京)云计算技术有限公司 | Method for deleting repeated data based on front end |
CN113672170A (en) * | 2021-07-23 | 2021-11-19 | 复旦大学附属肿瘤医院 | Redundant data marking and removing method |
CN114138552A (en) * | 2021-11-11 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Data dynamic deduplication method, system, terminal and storage medium |
CN114138552B (en) * | 2021-11-11 | 2024-01-12 | 苏州浪潮智能科技有限公司 | Data dynamic repeating and deleting method, system, terminal and storage medium |
CN115016330A (en) * | 2022-08-10 | 2022-09-06 | 深圳市虎一科技有限公司 | Automatic menu and intelligent kitchen power matching method and system |
CN117369731A (en) * | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
CN117369731B (en) * | 2023-12-07 | 2024-02-27 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102323958A (en) | Data de-duplication method | |
US11288235B2 (en) | Synchronized data deduplication | |
US10031675B1 (en) | Method and system for tiering data | |
CN102222085B (en) | Data de-duplication method based on combination of similarity and locality | |
CN101989929B (en) | Disaster recovery data backup method and system | |
US8983952B1 (en) | System and method for partitioning backup data streams in a deduplication based storage system | |
US10303797B1 (en) | Clustering files in deduplication systems | |
CN101777017B (en) | Rapid recovery method of continuous data protection system | |
US8712963B1 (en) | Method and apparatus for content-aware resizing of data chunks for replication | |
CN101963982B (en) | Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash | |
CN102831222A (en) | Differential compression method based on data de-duplication | |
CN109445702B (en) | block-level data deduplication storage system | |
CN104932956A (en) | Big-data-oriented cloud disaster tolerant backup method | |
CN104932841A (en) | Saving type duplicated data deleting method in cloud storage system | |
US9183218B1 (en) | Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal | |
CN101866358A (en) | Multidimensional interval querying method and system thereof | |
US10838923B1 (en) | Poor deduplication identification | |
CN107885619A (en) | A kind of data compaction duplicate removal and the method and system of mirror image remote backup protection | |
CN105630810A (en) | Method for uploading mass small files in distributed storage system | |
CN113672170A (en) | Redundant data marking and removing method | |
CN105095027A (en) | Data backup method and apparatus | |
CN109445703B (en) | A kind of Delta compression storage assembly based on block grade data deduplication | |
CN102722450B (en) | Storage method for redundancy deletion block device based on location-sensitive hash | |
CN103049508A (en) | Method and device for processing data | |
CN114442937B (en) | File caching method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120118 |