WO2015067128A1 - 一种堆叠式重复数据删除文件*** - Google Patents

一种堆叠式重复数据删除文件*** Download PDF

Info

Publication number
WO2015067128A1
WO2015067128A1 PCT/CN2014/089303 CN2014089303W WO2015067128A1 WO 2015067128 A1 WO2015067128 A1 WO 2015067128A1 CN 2014089303 W CN2014089303 W CN 2014089303W WO 2015067128 A1 WO2015067128 A1 WO 2015067128A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file system
service module
deduplication
storage
Prior art date
Application number
PCT/CN2014/089303
Other languages
English (en)
French (fr)
Inventor
王恩东
文中领
张立强
孟圣智
Original Assignee
浪潮(北京)电子信息产业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮(北京)电子信息产业有限公司 filed Critical 浪潮(北京)电子信息产业有限公司
Publication of WO2015067128A1 publication Critical patent/WO2015067128A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Definitions

  • the present invention relates to the field of computer storage, and in particular to a data deduplication file system implemented based on a stacked file system technology.
  • the present invention designs a stacked deduplication file system, which can provide deduplication function based on the existing mature file system, fully maintain the performance of the original storage system, and hardly need to perform any data. migrate.
  • the main purpose of the present invention is to provide a stacked deduplication file system, which can fully utilize the storage capacity of an existing storage system, and does not need to upgrade hardware to maximize investment, and has been designed by stacked software.
  • Some file systems provide deduplication, optimize the data storage structure, and reduce the space occupied by the storage system.
  • the present invention provides a stacked deduplication file system, the system comprising:
  • the file system service module for normal data, uses the direct interface conversion method to import the data of the underlying file system into the file system; for the data that has been deduplicated, the corresponding data attribute identifier is read, and the IO process is heavy. Orientation, transparent and seamless access to data after deduplication;
  • the service module is deleted, the file system log data exported by the file system service module is read, the data signature is calculated, the duplicate data is detected and deleted, and the data is identified after the deduplication is completed.
  • the invention has the beneficial effects that the design of the stacked file system can make full use of the existing storage system, and the existing file system can support the deduplication function to save storage space only by installing the software system described in this patent. No need to migrate data, while maintaining the IO performance of the original storage system, to achieve full equipment and investment protection.
  • FIG. 1 is a schematic structural diagram of a stacked deduplication file system according to the present invention.
  • the stacked deduplication file system of the present invention mainly includes: a file system service module and a deduplication service module.
  • the file system service module implements a file system that fully supports the POSIX protocol. It adopts the design strategy of the stacked file system, and implements the services of the underlying file system through mapping and rewriting at the file system interface layer. For normal data, this module uses the direct interface conversion method to import the data of the underlying file system into the file system, achieving seamless access to normal data. For data that has been deduplicated, the module reads the corresponding data attribute identifier according to the convention of the file system described in the present invention, performs redirection of the IO process, and implements transparent seamless access of the data after deduplication.
  • the deduplication service module runs independently from the band. It adopts multi-thread design and makes full use of the parallel computing capability of multi-core systems to provide ultra-high-speed deduplication.
  • This module reads the file system log data exported by the file system service module, parses the log content, performs data signature calculation, duplicate data detection and deletion, and identifies the data after deduplication.
  • This module can be run simultaneously with the file system service module. Through the fine-grained lock designed in the file system service module, the atomicity of data processing is guaranteed, and reliable parallel data processing capability is provided.
  • the file system service module and the deduplication service module can be installed into the host system as a single application software. After the relevant software configuration is performed, the file system service module and the deduplication service module can be started. At this time, the file system described in the present invention can be mounted on the host, and data access can be performed. After the file system IO is completed for a period of time, the deduplication service module can automatically perform the calculation of the data signature, and perform the detection and deletion of the duplicate data according to the configuration parameters, and complete the marking of the data after the deduplication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

提出一种堆叠式重复数据删除文件***,包括文件***服务模块,对于正常的数据,采用直接接口转换的方式将底层文件***的数据导入本文件***中;对于进行了重复数据删除的数据,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问;重删服务模块,读取文件***服务模块导出的文件***日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。所述***能够充分利用已有存储***的存储能力,无需升级硬件最大限度地节省投资,通过堆叠式的软件设计,在已有的文件***上提供重复数据删除功能,优化数据存储结构,降低存储***的空间占用。

Description

一种堆叠式重复数据删除文件*** 技术领域
本发明涉及计算机存储领域,具体涉及一种基于堆叠式文件***技术实现的重复数据删除文件***。
背景技术
在大型存储***中,数据急速增长与存储设备升级相对缓慢的矛盾较为尖锐,为了缓解存储***的空间增长问题,缩减数据占用的空间,降低成本,最大化利用已有资源,重复数据删除技术已经成为大型***中必不可少的关键技术。
通过使用重复数据删除技术,用户可以获得明显的数据缩减效果,可以大大降低存储***的带宽需求,降低运营成本和维护成本。通过数据缩减使得后端实际的存储容量大大缩减,由此带来了更简洁的存储管理,有效降低了管理成本。
然而目前流行的重复数据删除方案,多为面向近线存储和备份存储的重删方案,而且往往与备份***紧密结合,因而无法提供一股性的文件***服务。能够在在线***中直接提供重复数据删除功能的产品较少,且均需要使用专有的文件***格式,这些专有的文件***往往在性能、功能、可靠性、可扩展性方面均存在诸多限制,使得在大型在线存储***中直接应用存在一定困难。
已有的大型存储***往往基于成熟的文件***构建,如ext3、ext4、xfs、1ustre等,这类文件***本身并不具备重复数据删除的功能,而如果要使用重复数据删除功能,则面临着需要使用专有的文件***,忍受明显可感知的性能降低,并进行大规模的数据迁移,这带来极高的时间和空间成本, 在已经有大量数据的存储***中,基本上没有可行性,成本过高。
针对这一现状,本发明设计了一种堆叠式重复数据删除文件***,能够基于已有的成熟的文件***提供重复数据删除功能,充分保持原有存储***的性能,同时几乎不需要进行任何数据迁移。
发明内容
有鉴于此,本发明的主要目的在于提供一种堆叠式重复数据删除文件***,能够充分利用已有存储***的存储能力,无需升级硬件最大限度地节省投资,通过堆叠式的软件设计,在已有的文件***上提供重复数据删除功能,优化数据存储结构,降低存储***的空间占用。
为达到上述目的,本发明提供一种堆叠式重复数据删除文件***,该***包括:
文件***服务模块,对于正常的数据,采用直接接口转换的方式将底层文件***的数据导入本文件***中;对于进行了重复数据删除的数据,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问;
重删服务模块,读取文件***服务模块导出的文件***日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。
本发明的有益效果是:基于堆叠式文件***的设计可以充分利用现有的存储***,仅通过安装本专利描述的软件***即可使已有的文件***支持重复数据删除功能以节省存储空间,无需迁移数据,同时保持了原有存储***的IO性能,实现充分的设备利旧和投资保护。
附图说明
附图1为本发明所提出的堆叠式重复数据删除文件***的架构示意图。
具体实施方式
下面参照附图1,并结合一具体实例来描述本发明实现这一体系结构的过程。
正如发明内容中所描述的,本发明堆叠式重复数据删除文件***主要包括:文件***服务模块、重删服务模块。
文件***服务模块实现了一个完整支持POSIX协议的文件***,其采用了堆叠式文件***的设计策略,通过在文件***接口层的映射和重写,将底层文件***的服务完整实现。对于正常的数据,本模块采用直接接口转换的方式将底层文件***的数据导入本文件***中,实现了正常数据的无缝访问。对于进行了重复数据删除的数据,本模块根据本发明所描述的文件***的约定,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问。
重删服务模块在带外独立运行,其采用多线程设计,充分利用多核***的并行计算能力,提供超高速的重复数据删除功能。本模块读取文件***服务模块导出的文件***日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。本模块可与文件***服务模块同时运行,通过文件***服务模块内设计的细粒度锁,保证数据处理的原子性,提供可靠的并行数据处理能力。
在一个典型的配置环境里,文件***服务模块、重删服务模块可作为一股应用软件安装到主机***中。在进行了相关的软件配置后,可启动文件***服务模块、重删服务模块,此时已经能够在主机上挂载本发明描述的文件***,并能够进行数据访问。在一段时间的文件***IO完成后,重删服务模块能够自动地进行数据签名的计算,并根据配置参数进行重复数据的检测和删除,并完成重删后数据的标记。
至此,已经完整实现了整个堆叠式重复数据删除文件***,实现了在 已有文件***上提供高性能重复数据删除服务的功能,极大的提高了存储***的空间利用率,有效保护了客户投资。
当然,本发明还可有其他多种实施例,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明的权利要求的保护范围。

Claims (1)

  1. 一种堆叠式重复数据删除文件***,其特征在于,包括:
    文件***服务模块,对于正常的数据,采用直接接口转换的方式将底层文件***的数据导入本文件***中;对于进行了重复数据删除的数据,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问;
    重删服务模块,读取文件***服务模块导出的文件***日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。
PCT/CN2014/089303 2013-11-05 2014-10-23 一种堆叠式重复数据删除文件*** WO2015067128A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310541623.5 2013-11-05
CN201310541623.5A CN103617177A (zh) 2013-11-05 2013-11-05 一种堆叠式重复数据删除文件***

Publications (1)

Publication Number Publication Date
WO2015067128A1 true WO2015067128A1 (zh) 2015-05-14

Family

ID=50167880

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/089303 WO2015067128A1 (zh) 2013-11-05 2014-10-23 一种堆叠式重复数据删除文件***

Country Status (2)

Country Link
CN (1) CN103617177A (zh)
WO (1) WO2015067128A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617177A (zh) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 一种堆叠式重复数据删除文件***
CN104133888B (zh) * 2014-07-30 2019-08-02 宇龙计算机通信科技(深圳)有限公司 一种多***数据处理方法、装置及终端
CN104391915B (zh) * 2014-11-19 2016-02-24 湖南国科微电子股份有限公司 一种数据重删方法
CN105205094A (zh) * 2015-08-12 2015-12-30 浪潮(北京)电子信息产业有限公司 一种多控共享存储***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7130867B2 (en) * 2001-02-21 2006-10-31 International Business Machines Corporation Information component based data storage and management
CN101908073A (zh) * 2010-08-13 2010-12-08 清华大学 一种文件***中实时删除重复数据的方法
CN103051671A (zh) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 一种集群文件***重复数据删除方法
CN103279502A (zh) * 2013-05-06 2013-09-04 北京赛思信安技术有限公司 一种具有与并行文件***结合的重复数据删除文件***的架构及方法
CN103617177A (zh) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 一种堆叠式重复数据删除文件***

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8266114B2 (en) * 2008-09-22 2012-09-11 Riverbed Technology, Inc. Log structured content addressable deduplicating storage
US20100082700A1 (en) * 2008-09-22 2010-04-01 Riverbed Technology, Inc. Storage system for data virtualization and deduplication

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7130867B2 (en) * 2001-02-21 2006-10-31 International Business Machines Corporation Information component based data storage and management
CN101908073A (zh) * 2010-08-13 2010-12-08 清华大学 一种文件***中实时删除重复数据的方法
CN103051671A (zh) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 一种集群文件***重复数据删除方法
CN103279502A (zh) * 2013-05-06 2013-09-04 北京赛思信安技术有限公司 一种具有与并行文件***结合的重复数据删除文件***的架构及方法
CN103617177A (zh) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 一种堆叠式重复数据删除文件***

Also Published As

Publication number Publication date
CN103617177A (zh) 2014-03-05

Similar Documents

Publication Publication Date Title
US11741053B2 (en) Data management system, method, terminal and medium based on hybrid storage
US10659554B2 (en) Scalable caching of remote file data in a cluster file system
US11093466B2 (en) Incremental out-of-place updates for index structures
US9313133B2 (en) Anticipatory warm-up of cluster resources for jobs processed on multiple cluster nodes
EP2972893B1 (en) Caching content addressable data chunks for storage virtualization
CN102624871B (zh) 一种基于分布式企业服务总线实现的远程文件同步方法
CN103218175B (zh) 多租户的云存储平台访问控制***
US20170206025A1 (en) Mapping systems and methods of an accelerated application-oriented middleware layer
CN103856567A (zh) 基于Hadoop分布式文件***的小文件存储方法
WO2015067128A1 (zh) 一种堆叠式重复数据删除文件***
CN105630810B (zh) 一种对于海量小文件在分布式存储***中上载的方法
CN102508886B (zh) 一种基于xml的空间数据增量同步更新方法
CN103312791A (zh) 物联网异构数据存储方法及***
CN101986655A (zh) 存储网络及该存储网络的数据读写方法
US9110820B1 (en) Hybrid data storage system in an HPC exascale environment
CN104462185A (zh) 一种基于混合结构的数字图书馆云存储***
CN110795416B (zh) 一种文件复制方法、装置、设备及可读存储介质
CN109783018A (zh) 一种数据存储的方法及装置
Niazi et al. Size matters: Improving the performance of small files in hadoop
CN102820998B (zh) 实现面向办公应用的双机容错服务***及其数据存储方法
CN105407044A (zh) 一种基于nfs的云存储网关***的实现方法
CN102760045B (zh) 一种智能存储设备及其数据处理方法
US9678971B2 (en) Packing deduplicated data in a self-contained deduplicated repository
CN104598396A (zh) 一种保证海量数据缓存实时性和一致性的***及方法
CN104580536A (zh) 一种元数据集群负载平衡实现方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14860819

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14860819

Country of ref document: EP

Kind code of ref document: A1