WO2021174515A1

WO2021174515A1 - Systems and methods for data storage in the expansion of object-based storage systems

Info

Publication number: WO2021174515A1
Application number: PCT/CN2020/078114
Authority: WO
Inventors: Li Wang; Yiming Zhang; Jiawei Xu
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2021-09-10

Abstract

Systems and methods for performing data storage in an object-based storage system that includes a first set of object storage devices (OSDs) are disclosed. An exemplary method may include expanding the object-based storage system by adding a second set of OSDs at a first time point. The method may also include receiving an object created at a second time point. The method may further include mapping the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.

Description

SYSTEMS AND METHODS FOR DATA STORAGE IN THE EXPANSION OF OBJECT-BASED STORAGE SYSTEMS

TECHNICAL FIELD

The present disclosure relates to systems and methods for data storage in decentralized object-based storage systems, and more particularly to, systems and methods for data storage in an expansion of a decentralized object-based storage system using an algorithm that controls data migration.

BACKGROUND

Object-based storage systems have been widely used for various scenarios such as distributed file storage, remote block storage, small object (e.g., profile pictures) storage, binary large object (blob) storage (e.g., large videos storage) , etc. In an object-based storage system, objects (i.e., the striped files) are distributed among a large number of object storage devices (OSDs) of a storage cluster (e.g., a set of loosely or tightly connected OSDs that work together) . Compared to filesystem-based storage systems, object-based storage systems simplify data layout by exposing an interface for reading and writing objects via object names, and thus reduce management complexity at the backend.

Object-based storage systems can be generally divided into two categories, namely the decentralized object-based storage systems and the centralized object-based storage systems. Decentralized placement methods uniformly distribute objects among OSDs without relying on a central directory, and their clients could directly access objects by calculating (instead of retrieving) the responsible OSDs. As a result, compared to the centralized object-storage systems, the decentralized object-storage systems have the advantages of high scalability, high robustness and high performance where the client can directly access objects by calculating (instead of retrieving) the IDs of the responsible OSDs (e.g., a string of numbers and letters that identifies every individual OSD within the system) .

When expanding the decentralized object-based storage system (e.g., adding new OSDs to the system) , data placement is critical for the scalability of decentralized object-based storage systems. The state-of-art data placement algorithm utilized by decentralized placement methods is the Controlled Replication Under Scalable Hashing (CRUSH) where objects are mapped onto a hierarchical storage cluster map comprising nodes representing OSDs, machines, racks, etc. CRUSH achieves statistical load balancing without a central directory and could automatically re-balance the load when the storage cluster map changes (e.g., system expansions or OSD (s) failures) .

However, after expanding the system, CRUSH would be applied to migrate the load of the entire system to re-balance the load. This would cause significant performance degradation when the expansion is not trivial (e.g., adding several racks of OSDs to the system) . Some existing methods attempt to avoid significant data migration after storage cluster expansions at the cost of temporary load imbalance.

For example, Ceph is a CRUSH-based object storage system which mitigates CRUSH’s migration problem via implementation-level optimizations after cluster expansions. The system limits the migration rate to a relatively low level, and redirects written objects to temporary OSDs to avoid blocking if the migration of the written object has not yet been completed. However, all objects (including the objects already written to the temporary OSDs) will eventually be migrated to the target OSDs calculated by the CRUSH algorithm. This will still cause the system to experience degraded performance for a long period of time.

Embodiments of the disclosure address the above problems by providing data storage systems and methods in the expansion of decentralized object-based storage systems using an algorithm that controls data migration.

SUMMARY

Embodiments of the disclosure provide a method for performing data storage in an object-based storage system that includes a first set of object storage devices (OSDs) . An exemplary method may include expanding the object-based storage system by adding a second set of OSDs at a first time point. The method may also include receiving an object created at a second time point. The method may further include mapping the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.

Embodiments of the disclosure also provide a system for performing data storage in an object-based storage system. An exemplary system may include a first set of object storage devices (OSDs) and a second set of object storage devices added at a first time point as a result of an expansion of the object-based storage system. The system may also include a communication interface configured to receive an object created at a second time point. The system may also include at least one processor coupled to the communication interface. The at least one processor may be configured to map the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.

Embodiments of the disclosure further provide a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for data storage in an object-based storage system. The method may include expanding the object-based storage system by adding a second set of OSDs at a first time point. The method may also include receiving an object created at a second time point. The method may further include mapping the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a schematic diagram of an exemplary object-based storage system, according to embodiments of the disclosure.

FIG. 2 illustrates a block diagram of an exemplary manage device, according to embodiments of the disclosure.

FIG. 3 illustrates a flowchart of an exemplary method for data storage in an object-based storage system, according to embodiments of the disclosure.

FIG. 4 illustrates a schematic framework of an exemplary hierarchical cluster map for mapping objects to a storage cluster, according to embodiments of the disclosure.

FIG. 5. illustrates a schematic framework of an exemplary expansion of the object-based storage system, according to embodiments of the disclosure.

FIG. 6. illustrates a schematic framework of data storage in an expanded object-based storage system, according to embodiments of the disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

FIG. 1 illustrates a schematic diagram of an exemplary object-based storage system (referred to as “storage system 100” ) , according to embodiments of the disclosure. In some embodiments, storage system 100 may include components shown in FIG. 1, including a user device 110, a storage management device (referred to as “management device” ) 120, storage cluster 130 and a network (not shown) for facilitating communications among the various components. Consistent with the present disclosure storage system 100 is configured to store data file (e.g., profile pictures, videos, etc. ) to a storage cluster 130 that includes multiple storage locations. In some embodiments, a data file may be striped and divided into a number of objects 103 and objects 103 are distributed for storage in the multiple storage locations in storage cluster 130. It is to be contemplated that storage system 100 may include more or less components compared to those shown in FIG. 1.

As shown in FIG. 1, storage system 100 may perform data storage and data retrieving (i.e., data input and output, I/O) . When performing the data storage to write/store objects 103to storage cluster 130, management device 120 may apply a placement algorithm (e.g., a MAPX algorithm which will be disclosed in detail below) to determine storage cluster mapping rules 102. In some embodiments, storage cluster mapping rules 102 may specify which storage location of storage cluster 130 a specific object 103 should be stored in. For example, management device 120 may determine/calculate and/or update a hierarchical cluster map using the MAPX algorithm and user device 110 may map objects 103 to storage cluster 130 according to the ID of the responsible object storage devices (OSDs) based on the calculation. In some embodiments, the hierarchical cluster map may also be updated when one or more OSDs are added to storage cluster 130 (i.e., an expansion/grown of storage cluster 130) at a certain time point. For example, management device 120 may receive expansion information 104 associated with the expansion of storage cluster 130 (e.g., including at least the time point when the expansion happens and information about the added OSDs) and update the hierarchical cluster map accordingly. In some embodiments, expansion information 104 may include more than one expansion happened at different time points.

FIG. 2 illustrates a block diagram of an exemplary manage device 120, according to embodiments of the disclosure, which will be described in more details later. In some embodiments, management device 120 may be configured to dynamically map data files to storage cluster 130 for storage as storage cluster 130 is expanded to include more OSDs. In some embodiments, management device 120 may execute a storage mapping algorithm that is designed to control data migration. For example, FIG. 3 illustrates a flowchart of an exemplary method 300 for data storage in an object-based storage system, according to embodiments of the disclosure, which will be described in more detail later. Method 300 may be performed by management device 120.

In some embodiments, the MAPX algorithm may be a CRUSH algorithm extended with a time-dimension mapping which may apply a virtual level to record and/or differentiate the time point the OSDs being added to storage cluster 130. Based on the chronological order of the time point the OSDs being added and the time point objects 103 are created by, storage system 100 may keep the placement of existing objects (e.g., objects created before a certain expansion) unaffected due to expansion (s) happened after the creation of the objects unless the objects are later remapped, which will be disclosed in detail below. This may lead to a controlled migration of data when the expansion of storage cluster 130 happens. As such, the data migration problem occurring upon expansions of conventional object-based storage systems (e.g., object-based storage systems using CRUSH algorithm for mapping) , may be significantly relieved. As a result, storage system 100 may avoid significant degradation in performance when expansion (s) happen. The scalability of storage system 100 may be tremendously increased accordingly.

On the other hand, when performing the data retrieving, the same disclosed algorithm may also be applied and thus may allow a client (e.g., user of user device 110) to directly access a specific object 103 by calculating the ID of the OSD that stores it.

In some embodiments, storage system 100 may optionally include a network to facilitate the communication among the various components of storage system 100, such as user device 110, management device 120 and storage cluster 130. For example, the network may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network may be replaced by wired data communication systems or devices.

In some embodiments, the various components of storage system 100 may be remote from each other or in different locations and be connected through the network. In some alternative embodiments, certain components of storage system 100 may be located on the same site or inside one device. For example, management device 120 may be located on-site with or be part of user device 110 such as being an application or a computer program that when executed by user device 110 would perform the storage management (e.g., create data files and/or) using the placement algorithm. As another example, management device 120 may be a computer or a processing device inside storage cluster 130 for generating and updating the hierarchical cluster map for storing the data files to storage cluster 130.

In some embodiments, user device 110 may include, but not limited to, a server, a database/repository, a netbook computer, a laptop computer, a desktop computer, a media center, a gaming console, a television set, a set-top box, a handheld device (e.g., smart phone, tablet, etc. ) , a smart wearable device (e.g., eyeglasses, wrist watch, etc. ) , or any other suitable device. In some embodiments, data file may contain data of a document, a picture, a piece of video or audio, an e-book or a combination thereof. Consistent with the present disclosure, the data file may be created by user device 110 and may be sent to storage cluster 130 for storage and/or replication (e.g., to protect against data loss in the presence of failures) . In some embodiments, the data file may be striped/divided into multiple objects 103 and may be stored to storage cluster 130 according to storage cluster mapping rules 102 (e.g., a hierarchical cluster map) generated/updated by management device 120.

In some embodiments, storage cluster 130 may include a group of interconnected storages that are working together as one unit. For example, storage cluster 130 may be a clustered storage system that includes two or more storage servers, namely object storage devices (OSDs) , working together to increase performance, capacity reliability, etc. Workloads (e.g., I/O and storage of the data) may be distributed among the multiple OSDs and may be transferred between the OSDs. In some embodiments, storage cluster 130 may include Ceph RADOS Block Devices (RBDs) or any suitable clustered storage system that can provide access to all files stored within any OSD of the system regardless of the location of the file. In some embodiments, storage cluster 130 may be a tightly coupled storage cluster that are cooperating, coordinating and sharing workloads to a great degree among each other, or a loosely coupled storage cluster with limits on the amount of coordination and sharing of workloads.

In some embodiments, OSDs included within storage cluster 130 are organized according to a hierarchical cluster map generated/updated by management device 120 using the MAPX algorithm. For example, FIG. 4 illustrates a schematic framework of an exemplary hierarchical cluster map for mapping objects 103 to storage cluster 130, according to embodiments of the disclosure. As illustrated in FIG. 4, hierarchical cluster map 400 may include a root 410 and multiple levels beneath root 410, such as a cabinet level 420, a shelf level 430 and an OSDs level 440. Those levels beneath root 410 are arranged in a descending order as illustrated in FIG. 4, and each level may include at least one node.

It is contemplated that hierarchical cluster map 400 may include more or less levels beneath root 410 than those shown in FIG. 4. The number of levels may vary, depending on the type of data storage cluster 130 stores, the frequency of the I/O of the stored data, the amount of data stored, etc. In some embodiments, levels may be added or deleted within hierarchical cluster map 400 when storage cluster 130 increases or decreases OSDs (i.e., when expansion (s) happen) . This may change the hierarch of certain level.

In some embodiments, each OSD has a weight assigned by management device 120 to control the amount of data stored in the OSD. For example, the load of an OSD is on average proportional to its weight. The weight of an internal bucket (e.g., a first node at a first level) , is recursively calculated as the sum of the weights of its child items (e.g., nodes at a second level below the first level below that are connected to the first node) .

When writing/storing objects 103 to storage cluster 130, at least one node at each level may be chosen based on the MAPX algorithm. For example, as illustrated in FIG. 4, objects 103 may be mapped to responsible placement groups (referred to as PGs hereafter) , and the responsible PGs may be mapped to one or more OSDs for storage. For example, objects 103 may be categorized into PGs by computing the modulo of the hashing of object name of objects 103, e.g., pgid = HASH (name) mod PG_NUM. Then the responsible PG (e.g., PG 450) may be mapped to OSDs by placement procedures of a CRUSH algorithm. For example, the mapping may be according to hierarchical cluster map 400, beginning at root 410. The values in the nodes’ parentheses represent the respective weights for the nodes. For example, root 410 of hierarchical cluster map 400 may be selected at the first operation (take (root) ) of the algorithm. n =3 nodes at the cabinet level from totally m = 4 nodes∈

beneath root 410 may be selected at the second operation (select (3, cab) ) . In one embodiment, the 3 nodes may be selected according to equation (1) :

where pgid is the ID of the input PG, r = 1, 2, ... is a parameter for the argmax computation, HASH is a three-input hash function, and ID (i) and W (i) are ID and weight of an item

respectively. In some embodiments, to choose n nodes at a certain level, equation (1) may be performed for more than n times. For example, if the output of equation (1) is taken by a previous computation, or the node is failed (e.g., representing failed/overloaded OSD (s) ) , equation (1) may have to be performed again.

Back to the example in FIG. 4, similarly, one or more nodes at the shelf level and one or more node at the OSD level beneath each of the chosen node at the cabinet level may be selected at subsequent operations in the CRUSH algorithm. For example, (select (1, shelf) ) and (select (1, osd) ) using equation (1) can be performed to select 1 node at the shelf level and 1 node at the OSO respectively.

The CRUSH algorithm may support flexible constraints for reliable replica placement by (i) encoding the information of failure domains (such as shared power source or network) into the hierarchical cluster map, and (ii) allowing manage device 120 to define the placement rules that specify how replicas are placed by recursively selecting nodes at each level of the hierarchical cluster map.

In some embodiments, storage cluster 130 may achieve scalability by increasing the OSDs included within storage cluster 130. Consistent with the disclosure, adding OSDs to storage cluster 130 is referred to as an “expansion. ” In some embodiments, OSDs may be added to storage cluster 130 at different time points through multiple expansions. For example, FIG. 5. illustrates a schematic framework of an exemplary expansion for the object-based storage system, according to embodiments of the disclosure. For the specific example, as illustrated in FIG. 5, two expansions happened to original storage cluster 130 which includes n cabinets each having 2 shelves. The first expansion added n shelfs 530 (one shelf to each of the 1, 2, ..., n cabinets) at a first time point and the second expansion added the n+1, n+2, ..., n+m cabinets 520 to storage cluster 130 at a second time point. Upon each expansion, management device 120 may apply the MAPX algorithm to update hierarchical cluster map 400 (e.g., into hierarchical cluster map 500) for mapping objects 103 to storage cluster 130 (e.g., the Ceph RBDs) .

As shown in FIG. 1, management device 120 may generate/update hierarchical cluster map 500 for mapping objects 103 to storage cluster 130 when expansion (s) happen. In some embodiments, management device 120 may receive expansion information 104 from storage cluster 130 and generate/update hierarchical cluster map 500 based on expansion information 104. For example, when mapping objects 103 to storage cluster 130, management device 120 may first map objects 130 to responsible PG (s) and may further map the responsible PG (s) to one or more OSDs within storage cluster 130 according to hierarchical cluster map 500.

To implement controlled data migration (e.g., for rebalancing purposes) , when expansion (s) happen, management device 120 may insert a virtual level right beneath root 510 (e.g., a layer level) . The virtual level includes nodes representing the different expansions and the original OSDs within storage cluster 130. As such, the data migration within storage cluster 130 may be controlled by an extra time-dimension mapping (e.g., based on the chronological order between the time point each expansion was made and the time point objects 103 are created) .

For example, FIG. 6. illustrates a schematic framework of data storage in an expanded object-based storage system 600, according to embodiments of the disclosure. As illustrated in FIG. 6, layers within a virtual level (e.g., layer level 610) may be assigned to the groups of OSDs according to the time they get added to object-based storage system 600. For example, “layer 0” is assigned to the original OSDs within storage cluster, “layer 1” is assigned to the OSDs expanded at the first time point and “layer 2” is assigned to the OSDs expanded at the second time, respectively. Each layer within the layer level may include the corresponding OSDs along with the nodes from the OSDs up to the root (not including the root) . For example, layer 0 may include all the nodes of the original storage cluster 130 before any expansion (e.g., expansions happened at the first and the second time points) . Layer 1 may represent the expansion at the first time point and may include shelfs 630 that are added by the first expansion and the nodes connected to shelfs 630 up to the root (e.g., cabinets connected to shelfs 630) . Layer 2 may represent the expansion at the second time point and may include cabinets 620 that are added by the second expansion as well as the nodes beneath cabinets 620 (e.g., the child items of cabinets 620) . At the time point of each expansion, at least one PG may be assigned to the corresponding layer representing that expansion.

After an expansion, when mapping objects 103 to storage cluster 130, management device 120 may first map objects 103 to responsible PGs, and then map the responsible PGs to one or more OSDs according to original and/or updated hierarchical cluster map 400 based on the chronological order of the time point the objects 103 are created and the time point of the expansion (s) ) . For example, when mapping the PG (s) , the MAPX algorithm may first select a layer for the PG (s) and select at least one node at each level beneath the layer level according to the CRUSH algorithm.

By applying the MAPX method for generating/updating hierarchical cluster map 400, management device 120 may differentiate the old OSDs (e.g., OSDs added before the expansion) , and the new OSDs (e.g., OSDs added at expansion (s) happened at certain time points) when re-balancing existing objects and/or storing new objects to storage cluster 130. In this way, the weight of the prior layer (i.e., layers representing OSDs before the expansion) , namely, the placement of existing objects will not be changed (or may only experience minor changes) due to the expansion happened afterwards unless the objects are later remapped. This may significantly relieve the data migration happened to conventional object-based storage systems (e.g., object-based storage systems using the conventional CRUSH algorithm for mapping) when expansion (s) happen. Thus, the scalability of the object-based storage system may be significantly increased by utilizing the disclosed algorithm.

In some embodiments, as shown in FIG. 2, management device 120 may include a communication interface 202, a processor 204, a memory 206, and a storage 208. In some embodiments, management device 120 may have different modules in a single device, such as an integrated circuit (IC) chip (e.g., implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) ) , or separate devices with dedicated functions. In some embodiments, one or more components of management device 120 may be located in a cloud or may be alternatively in a single location (such as inside a mobile device) or distributed locations. Components of management device 120 may be in an integrated device or distributed at different locations but communicate with each other through a network (not shown) . In some embodiments, part of or all of the components of management device 120 may be within user device 110. For example, communication interface 202, a processor 204, a memory 206, and a storage 208 may be part of user device 110 and may be shared by management device 120 for the data storage management purposes. Consistent with the president disclosure, management device 120 may be configured to generate and/or update hierarchical cluster map 600 for mapping objects 103 to storage cluster 130.

Communication interface 202 may send data to and receive data from components such as user device 110 and storage cluster 130 via communication cables, a Wireless Local Area Network (WLAN) , a Wide Area Network (WAN) , wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth ^TM) , or other communication methods. In some embodiments, communication interface 202 may include an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 202 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 202. In such an implementation, communication interface 202 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Consistent with some embodiments, communication interface 202 may receive data files from user device 110 and expansion information 104 from storage cluster 130. Communication interface 202 may further provide the received data and information to memory 206 and/or storage 208 for storage or to processor 204 for processing.

Processor 204 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 204 may be configured as a separate processor module dedicated to mapping objects 103 using a placement algorithm (e.g., the MAPX CRUSH) . Alternatively, processor 204 may be configured as a shared processor module (e.g., a processor of user device 110) for performing other functions in addition to data storage mapping.

Memory 206 and storage 208 may include any appropriate type of mass storage provided to store any type of information that processor 204 may need to operate. Memory 206 and storage 208 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 206 and/or storage 208 may be configured to store one or more computer programs that may be executed by processor 204 to perform functions disclosed herein. For example, memory 206 and/or storage 208 may be configured to store program (s) that may be executed by processor 204 to generate and/or update hierarchical cluster map 600 for mapping the data storage.

In some embodiments, memory 206 and/or storage 208 may also store intermediate data such as hash result, PGs’ ID, cache data such as timestamps assigned to each node within hierarchical cluster map 600, nodes selections at each level, etc. Memory 206 and/or storage 208 may additionally store placement algorithms including their parameters.

As shown in FIG. 2, processor 204 may include multiple modules, such as a hierarchical cluster map update unit 240, an object storage device mapping unit 242, a placement group device mapping unit 244, a placement group remapping unit 246, and the like. These units (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 204 designed for use with other components or software units implemented by processor 204 through executing at least part of a program. The program may be stored on a computer-readable medium, and when executed by processor 204, it may perform one or more functions. Although FIG. 2 shows units 240-246 all within one processor 204, it is contemplated that these units may be distributed among different processors located closely or remotely with each other.

In some embodiments, units 242-246 of FIG. 2 may execute computer instructions to perform the data storage, e.g., method 300 illustrated by FIG. 3. The data storage may be implemented according to expanded hierarchical cluster map 600 (illustrated in FIG. 6) . Method 300 may be implemented by management device 120 and particularly processor 204 or a separate processor not shown in FIG. 2. Method 300 may include steps S302-S312 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 2 and FIG. 3 would be disclosed together.

In step S302, communication interface 202 may receive expansion information 104 associated with expansion (s) to the storage cluster. For example, as illustrated in FIG. 6, expansion information 104 may include information associated with a first expansion adding shelfs 630 (e.g., adding one shelf to each of the original cabinets) to storage cluster 130 at a first time point t ₁. It is contemplated that expansion information 104 may include information associated with more than one expansion. For example, expansion information 104 may also include information associated with a second expansion adding cabinets 620 to storage cluster 130 at a second time point t ₂.

In step S304, hierarchical cluster map update unit 240 may update the hierarchical cluster map based on expansion information 104. For example, as illustrated in FIG. 6, a virtual level (e.g., layer level 610) may be inserted beneath the “root” to label each set of OSDs within storage cluster according to corresponding timestamps at which storage cluster 130 is expanded to add the respective OSDs. For example, the original OSDs within storage cluster 130 along with their parent items associated with a timestamp t ₀ may be labeled as layer 0. The expansion at the first time point (e.g., associated with a timestamp t ₁ corresponding to the time point the first expansion happened) , may be labeled as layer 1. Similarly, the expansion at the second time point (e.g., associated with a timestamp t ₂ corresponding to the time point the second expansion happened) , may be labeled as layer 2. As a result, each layer may include the original OSDs or the OSDs expanded at a certain time point, and their parent/child items (e.g., nodes from the OSDs up to but not including the root) . In other words, the OSDs in a current object-based storage system 100 are grouped within different layers according to their timestamps. For example, layer 0 may represent the original storage cluster and include all the nodes of the original storage cluster 130 beneath the “root” before the first and the second expansion. Layer 1 may represent the first expansion and may include shelfs 630 and the corresponding nodes up to the “root” . Layer 2 may represent the expansion at the second time point and may include cabinets 620 as well as the nodes beneath cabinets 620.

In some embodiments, at least one PG may be assigned to each layer. The PG assigned to the layer may have a static timestamp t _pgs that equals to the timestamp associated with the layer. For example, the PG (s) being assigned to layer 1 may be associated with the static timestamp t _pgs= t ₁, and the PGs being assigned to layer 2 may be associated with the static timestamp t _pgs= t ₂.

In step S306, communication interface 202 may receive a data file from user device 110. For example, communication interface 202 may receive data files RBD ₁, RBD ₂ and RBD ₃ at a third timepoint, a fourth timepoint and a fifth timepoint respectively. In some embodiments, RBD ₁, RBD ₂ and RBD ₃ may each be striped/distributed to corresponding objects 103. Each object 103 may include information of the data (e.g., the name of object 103) . Objects 103 received at different time points may each be assigned with a timestamp corresponding to the time point objects 103 are created. For example, objects 103 of RBD ₁ may be assigned with the timestamp of t ₃ corresponding to the third time point, objects 103 of RBD ₂ may be assigned with the timestamp of t ₄ corresponding to the fourth time point and objects 103 of RBD ₃ may be assigned with the timestamp of t ₅ corresponding to the fifth time point.

In step S308, object storage device mapping unit 242 may map objects 103 to a responsible placement group among a plurality of placement groups associated with original storage cluster 130 or the expansion (s) based on the name of each object. For example, placement group mapping unit 242 may calculate the ID of the PG for mapping an object 103 according to equation (2) :

where name is the name of object 103 and INIT_PG_NUM [i] is the initial number of PGs of the i ^th layer. The m ^th layer is the layer associated with the latest timestamp t _l≤ t ₃ among all layers.

In some embodiments, each object 103 of the RBDs may be mapped to a responsible PG based on the chronological order of the timestamps of the RBD and the PG (s) (e.g., the PG that has the latest timestamp t _pgs before t ₃, t ₄ or t ₅ respectively) . For example, suppose RBD ₁, RBD ₂ and RBD ₃ are received before the first expansion, between the first and the second expansion and after the second expansion respectively, namely, t ₀ _≤ t ₃ _< t ₁ _≤ t ₄ _< t ₂ _≤ t ₅, RBD ₁, RBD ₂ and RBD ₃ may be assigned to layer 0, layer 1 and layer 2 respectively according to equation (2) .

In step S310, placement group mapping unit 244 may map PG (s) to OSDs based on hierarchical cluster map 600 using MAPX. Similar to the conventional CRUSH algorithm, MAPX maps a PG onto a list of OSDs following a sequence of operations in a user-defined placement rule. MAPX implicitly adds a select operation (select (1, layer) ) to the placement rule, so as to realize the time-dimension mapping where the chronological order of the static timestamps of each PG and each layer are used for mapping PGs to layers, without the interference of operator.

Specifically, MAPX extends the CRUSH’s original select operation to support the layer-type selection () . In some embodiments, the pseudo code of the algorithm for the layer-type selection () is shown in Table 1 below.

Table 1

Noted in the pseudo code, if type is not “layer” , then the processing is reduced to CRUSH (Lines 2-4) . Otherwise, an array of layers which stores all layers beneath the currently processing bucket (usually the “root” ) in an ascending order of the layers’ timestamp may be initiated (Line 5) . Also, num_layers (the number of layers) , pg (the placement group) and

(the output list) may be initiated (Lines 6-8) . Then number layers in the array of layers may be added to the output list

(Lines 9-21) . In most cases, only one expansion may be made to storage cluster 130 and thus, number = 1 where the PG may be mapped to OSDs in one layer. In some embodiments, for example, as illustrated in FIG. 6, two expansions may be made to storage cluster 130 and thus, number = 2. It is contemplated that number is not limited to 1 or 2, and may be a larger number for e.g., mirroring between two layers of expansions.

In some embodiments, a selection () operation may be applied to check whether layer has been chosen by pervious selection () (Lines 12-14) . If so, the current loop may be ended, and the algorithm proceeds to the next loop. This may prevent duplicate layer selections when performing backtracking, i.e., when a select () cannot select enough items beneath a “layer” bucket, MAPX will retain (rather than abandon) the selected items and backtrack to the root to select the lacking items beneath a previous layer. For example, as illustrated in FIG. 6, the second expansion added two cabinets 620 to storage cluster 130. Suppose PG (s) assigned to layer 2 require three cabinets (e.g., with a select function select (3, cabinet) in the CRUSH algorithm) . This will cause the selection function at the layer level (e.g., select (1, layer) ) to be invoked twice due to the backtracking. The selection () operation may ensure no overflow happens to layer 2 by returning layer 2 and layer 1 for the first and the second invocations respectively in this example.

In some embodiments, method 300 may optionally include step S312. In step S312, placement group remapping unit 246 may remap PG (s) to another layer (i.e., a layer different from the currently assigned layer) for re-balancing the workload (as well as the hierarchical cluster map) when load of a layer changes because of e.g., removals of objects, failures of OSDs, unpredictable workload changes etc. For example, it is possible that the cluster performs the second expansion (layer 2) when the load of the first expansion (layer1) is as high as that of the original cluster (layer 0) , but afterwards a large number of objects of layer 1 are removed and consequently the loads of the first two layers may get severely imbalanced.

In some embodiments, in addition to the static timestamp t _pgs initially assigned to each PG, placement group remapping unit 246 may additionally assign a dynamic timestamp t _pgd to the PG. In some embodiments, the dynamic timestamp t _pgd can be set as the time point of any expansion (i.e., equals to the timestamp of the layer corresponding to that expansion) . Different from the operation in step S310, where the static timestamp of each PG is used for mapping the PG to OSDs, placement group remapping unit 246 may use the dynamic timestamp t _pgd for mapping the PG to OSDs. Accordingly, PG (s) may be remapped to any layer by manipulating the dynamic timestamp t _pgd. For example, as illustrated in FIG. 6, objects 103 of PG 640 may be remapped from OSDs in layer 1 to OSDs in layer 0 by setting the dynamic timestamp t _pgd of PG 640 to be t ₀ (in contrast with the static timestamp of PG 640 which has the value of t ₁) .

Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method for performing data storage in an object-based storage system, wherein the object-based storage system comprises a first set of object storage devices (OSDs) , comprising:

expanding the object-based storage system by adding a second set of OSDs at a first time point;

receiving an object created at a second time point; and

mapping the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.
The method of claim 1, wherein mapping the object further comprises:

mapping the object to the first set of OSDs if the first time point is later than the second time point; and

mapping the object to the second set of OSDs if the first time point is no later than the second time point.
The method of claim 1, wherein mapping the object further comprises:

mapping the object to a responsible placement group among a plurality of placement groups associated with the expansion; and

mapping the responsible placement group to one or more OSDs.
The method of claim 3, wherein mapping the responsible placement group to one or more OSDs uses a hierarchical cluster map, wherein the hierarchical cluster map comprises at least a first level and a second level beneath the first level, and wherein the first level includes a plurality of first nodes, each first node connected with a plurality of second nodes at the second level.
The method of claim 4, wherein mapping the responsible placement group to one or more OSDs further comprises selecting one or more first nodes at the first level and selecting one or more second nodes connected to the selected first nodes at the second level according to a user-defined placement rule.
The method of claim 4, wherein the object-based storage system has a plurality of expansions, and each node at the first level corresponds to one of the plurality of expansions to the object-based storage system.
The method of claim 3, further comprising:

dynamically associating the responsible placement group with an additional expansion to the object-based storage system occurred at a third time point; and

remapping the responsible placement group to the expansion occurred at third time point based on a chronological order between the second time point and the third time point.
The method of claim 3, wherein mapping the responsible placement group to one or more OSDs further comprises:

mapping the responsible placement group to an OSD added at a latest time point no later than the second time point.
The method of claim 3, wherein mapping the object to a responsible placement group further comprises calculating a modulo of a hash function applied on a name of the object against the number of placement groups associated with the expansion.
A system for performing data storage in an object-based storage system, comprising:

a first set of object storage devices (OSDs) ; a second set of object storage devices added at a first time point as a result of an expansion of the object-based storage system;

a communication interface configured to receive an object created at a second time point; and

at least one processor coupled to the communication interface and configured to map the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.
The system of claim 10, wherein to map the object, the at least one processor is further configured to:

map the object to the first set of OSDs if the second time point is later than the first time point; and

map the object to the second set of OSDs if the first time point is no later than the second time point.
The system of claim 11, wherein to map the object, the at least one processor is further configured to:

map the object to a responsible placement group among a plurality of placement groups associated with the expansion; and

map the responsible placement group to one or more OSDs.
The system of claim 12, wherein the responsible placement group is mapped to one or more OSDs using a hierarchical cluster map, wherein the hierarchical cluster map comprises at least a first level and a second level beneath the first level, and wherein the first level includes a plurality of first nodes, each first node connected with a plurality of second nodes at the second level.
The system of claim 13, wherein to map the responsible placement group to one or more OSDs, the at least one processor is further configured to select one or more first nodes at the first level and selecting one or more second nodes connected to the selected first nodes at the second level according to a user-defined placement rule.
The system of claim 13, wherein the object-based storage system has a plurality of expansions, and each node at the first level corresponds to one of the plurality of expansions to the object-based storage system.
The system of claim 12, wherein the at least on processor is further configured to:

dynamically associate the responsible placement group with an additional expansion to the object-based storage system occurred at a third time point; and

remap the responsible placement group to the expansion occurred at third time point based on a chronological order between the second time point and the third time point.
The system of claim 12, wherein to map the responsible placement group, the at least on processor is further configured to:

map the responsible placement group to an OSD added at a latest time point no later than the second time point.
The system of claim 12, wherein to map the object to a responsible placement group, the at least on processor is further configured to calculate a modulo of a hash function applied on a name of the object against the number of placement groups associated with the expansion.
A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for performing data storage in an object-based storage system, wherein the object-based storage system comprises a first set of object storage devices (OSDs) , the method comprising:

expanding the object-based storage system by adding a second set of OSDs at a first time point;

receiving an object created at a second time point; and

mapping the object to the OSDs of the expanded object-based storage system based on a chronological order between the first time point and the second time point.
The non-transitory computer-readable medium of claim 19, wherein mapping the object further comprises:

mapping the object to the first set of OSDs if the first time point is later than the second time point; and

mapping the object to the second set of OSDs if the first time point is no later than the second time point.