CN112860186A

CN112860186A - Capacity expansion method for billion-level object storage bucket

Info

Publication number: CN112860186A
Application number: CN202110162888.9A
Authority: CN
Inventors: 张燕咏; 张致江; 王芝斌
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-05-28

Abstract

The invention discloses a capacity expansion method of billions of object storage buckets, which comprises the following steps: constructing a unified management platform comprising a virtual mapping layer, a metadata center, an entry component and an operation and maintenance component; the control plane and the data plane stored by the heterogeneous objects are uniformly managed through the uniform management platform, and smooth switching during capacity expansion is achieved. The specific capacity expansion method comprises the following steps: 1) the scale risk of a single cluster and a single barrel does not exist, and the capacity can be expanded horizontally in nature; during capacity expansion, the service is not sensed, and hot switch configuration is directly effective; 2) the performance reliability of the object storage service is ensured by accessing a plurality of heterogeneous object storage clusters and a virtual mapping method without the performance bottleneck of a single cluster and a single bucket; 3) the performance reliability of the object storage service is ensured by accessing a plurality of heterogeneous object storage clusters and a virtual mapping method without the performance bottleneck of a single cluster and a single barrel.

Description

Capacity expansion method for billion-level object storage bucket

Technical Field

The invention relates to the technical field of cloud storage, in particular to a capacity expansion method for billions of object storage buckets.

Background

Due to the rapid development of big data and AI technologies, higher demands are also put on the underlying storage system: the performance of the existing object storage single cluster single-bucket data is sharply reduced after the scale exceeds 5-10 hundred million, the metadata storage is sharply expanded, and the capacity expansion in the single bucket brings great challenges.

Most of the existing capacity expansion of the object storage single bucket only supports the capacity expansion under a single cluster, and the capacity expansion method of the current object storage single bucket is simply introduced by taking two large open source storage systems (SWIFT and CEPH) in the field of object storage as an example.

The Swift open source object storage bucket storage engine is an sqlite database, once the index of the sqlite database reaches about 5 hundred million, the performance of the sqlite database is seriously reduced, and capacity expansion in a single cluster or a single bucket is difficult at the moment. After bucket indexes in a Ceph source object storage bucket reach hundred million levels, metadata index reorganization strategy can be adopted, but a large amount of metadata can be balanced, and once a single bucket reaches billions, the performance of a system can be greatly affected during metadata capacity expansion and reorganization.

In the prior art, only the storage scale of metadata in a single cluster single bucket is considered, the scale is very limited, how to achieve the scale of an object storage single bucket reaches the order of hundreds of millions, and ensuring no perception of service during capacity expansion is a problem which needs to be solved urgently at present.

Disclosure of Invention

The invention aims to provide a capacity expansion method of object storage buckets at billions level, which can realize the elastic capacity expansion of object storage single cluster single buckets under mass data, has no perception of service during the capacity expansion period, has no data balance operation in bottom storage, and ensures the service availability.

The purpose of the invention is realized by the following technical scheme:

a capacity expansion method for billions of object storage buckets comprises the following steps: constructing a unified management platform comprising a virtual mapping layer, a metadata center, an entry component and an operation and maintenance component; wherein:

a virtual mapping layer: the mapping relation between the object storage bucket and a plurality of heterogeneous object storage clusters at the bottom layer is responsible; wherein a user has one or more object buckets, each object bucket having one or more storage objects stored therein;

an inlet assembly: the system is responsible for intercepting a user request, determining an object corresponding to the user request, and forwarding a request route according to a metadata center, a virtual mapping and actual information of a bottom object storage cluster;

a metadata center: the system is responsible for recording the position information and the persistent configuration information of the object storage cluster of the object at the bottom layer;

operation and maintenance components: the platform is responsible for initializing relevant configuration information and changing operation of an entry component, a virtual mapping layer and a metadata center;

the control plane and the data plane of the heterogeneous object storage clusters are managed in a unified mode through the unified management platform, and smooth switching during capacity expansion is achieved.

According to the technical scheme provided by the invention, 1) for the object storage single-bucket capacity expansion method, no data balance exists, the scale risk of a single cluster single bucket does not exist, and the capacity can be expanded horizontally naturally; during capacity expansion, the service is not sensed, and hot switch configuration is directly effective. 2) For the object storage single-bucket capacity expansion method, the performance bottleneck of a single cluster and a single bucket does not exist, and the performance reliability of the object storage service is ensured by accessing a plurality of heterogeneous object storage clusters and a virtual mapping method. 3) The requirement on the cluster type of the bottom-layer object storage cluster is low, bottom-layer data storage can be realized in a heterogeneous multi-cluster mode, and the capacity expansion cluster can be added into the cluster as long as the object storage cluster supports a swift interface or an S3 interface.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram of the basic architecture of a method for expanding a billion class object bucket according to an embodiment of the present invention;

FIG. 2 is a logic diagram of a virtual map provided by an embodiment of the present invention;

FIG. 3 is a block diagram of the structure and processing logic of an entry component according to an embodiment of the present invention;

FIG. 4 is a diagram of a metadata center engine and CEPH relationship provided by an embodiment of the present invention;

FIG. 5 is a diagram of a basic architecture of a metadata center provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a two-dimensional HASH algorithm of a metadata center according to an embodiment of the present invention;

fig. 7 is a basic structure diagram of a multi-version of a metadata center according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The existing capacity expansion scheme and the business use visual angle can solve the storage bottleneck of a single bucket by newly building a storage bucket on a timestamp or a bucket index, thereby solving the problem of single-bucket storage scale. In most service scenes, a service maintains a set of global index DB, and after reaching a certain scale, the global index DB needs to be sorted and sorted. From the bottom layer of the storage single bucket, the capacity is generally expanded through the recombination of the metadata groups in the bucket, so that the problem of the capacity expansion of the storage scale of the metadata in the single bucket is solved.

However, no matter how to store the single-bucket bottom layer or the business layer, it is not big to deal with billions of problems on the single-bucket metadata scale, but when facing billions of metadata buckets, the following key problems need to be solved:

1) how to guarantee stability of the underlying storage service during metadata expansion.

2) How to solve the billions of scale problem of a single bucket of metadata.

3) How to smoothly expand the storage cluster data plane after a single bucket of a single storage cluster reaches the upper limit of hundreds of millions.

The embodiment of the invention provides a capacity expansion method for billions of object storage buckets, which solves the pain problems that the existing capacity expansion scheme cannot elastically expand the capacity under mass data and the capacity expansion is difficult.

As shown in fig. 1, a basic architecture diagram of a container expansion method for object buckets in the billion level is shown, in which a unified management platform including a virtual mapping layer, a metadata center, an entry component, and an operation and maintenance component is first constructed; the control plane and the data plane of the heterogeneous object storage clusters are managed in a unified mode through the unified management platform, and smooth switching during capacity expansion is achieved. Wherein: 1) the data plane for uniformly managing a plurality of heterogeneous object storage clusters comprises the following steps: and acquiring an object storage cluster inlet corresponding to a certain object of a current user through the metadata center and the virtual mapping layer, if no record exists, starting an inlet component to inquire the object storage cluster inlet corresponding to the object, updating the object storage cluster inlet to the metadata center after related record information is inquired, and simultaneously forwarding the record information to an object storage cluster corresponding to the bottom layer. 2) The control plane for uniformly managing a plurality of heterogeneous object storage clusters comprises: when the water level height of a certain default written object storage cluster exceeds a set value, the default written object storage cluster is smoothly switched into a new object storage cluster, the service is not sensed during the capacity expansion, and the bottom layer storage does not have any data balancing operation, so that the service availability is ensured.

The introduction of each part of the unified management platform is as follows:

a virtual mapping layer: the mapping relation between an object storage bucket (user bucket) and a plurality of heterogeneous object storage clusters at the bottom layer is responsible;

operation and maintenance components: the platform is responsible for initializing relevant configuration information and changing operations of the entry component, the virtual mapping layer and the metadata center.

In addition, the unified management platform is also provided with a data migration component which is mainly responsible for rollback migration operation of the plurality of object storage clusters after capacity expansion and change failures.

It will be understood by those skilled in the art that a user may have one or more object buckets, each of which stores one or more storage objects, referred to herein as objects for short, which is a term specific to the art that means that there is no hierarchy in a hierarchy that features extended metadata.

For ease of understanding, the following detailed description is directed to various portions of the unified management platform.

One, virtual mapping layer.

As shown in fig. 2, is a logical diagram of a virtual map. The object storage bucket and the heterogeneous object storage cluster at the bottom layer present a one-to-many relationship, and the related mapping relationship is stored and maintained by a virtual mapping layer; when the water level height of a certain default written object storage cluster (at an initial time, which may be set by itself) exceeds a set value, the hot validation changes the default written object storage cluster, and the write traffic of the current object bucket is forwarded to the changed default written object storage cluster (where the operation is the aforementioned control plane for uniformly managing multiple heterogeneous object storage clusters).

The virtual mapping layer can support billions of scales by matching with a metadata center, the storage requirement of EB level can be supported by a single barrel easily, the storage barrel is effective in capacity expansion and hot without data balance, the stability and scale of a single object storage cluster are controlled within a certain range, and data with long history can be filed.

And secondly, an inlet assembly.

FIG. 3 shows the structure and processing logic of the portal component. The service access protocol may use S3 or Swift common to the object storage domain. The working process of the inlet assembly mainly comprises the following steps:

1) after intercepting the user request, determining an object corresponding to the user request, and authenticating whether a user key carried in the user request is legal or not through a service layer.

2) If contract rule enters the routing layer process, the routing layer first determines the type of user request (PUT/GET/DELETE/HEAD) and whether the related file is fragmented.

As will be understood by those skilled in the art, the HTTP request method is referred to herein, PUT is a write or update operation, and GET, HEAD, DELETE are a query operation, a read operation, and a DELETE operation, respectively. The user requests to correspond to a bottom layer of operation objects at a time; a user may create multiple object buckets; multiple objects (i.e., objects) may be created inside the object bucket.

3) And for the PUT request of the non-fragmented file, acquiring the latest record information through the metadata center. Specifically, the method comprises the following steps: in order to accelerate the query speed, the object storage clusters are directly written by default to be recorded, during which the object (object) is easy to exist in a plurality of object storage clusters, the metadata center query always returns that the record is latest, if the metadata center does not exist, the metadata center query is distinguished by reading the timestamp of the HEAD information (object HEAD) of the object, and the record with the latest timestamp is returned to the entry component.

4) For a GET request, a HEAD request or a DELETE request of a non-fragmented file, an RPC request service (RPC is a network communication protocol, used by internal components) is sent to a metadata center, whether the metadata center has the Object (i.e. record information of the Object sending the request) is checked, if yes, the metadata center is directly switched to the corresponding underlying Object storage cluster, and if not, each underlying Object storage cluster is queried through a bottom-in Object HEAD (reading the HEAD information of the Object) request, and the queried data is written back to the metadata center and then forwarded to the corresponding underlying Object storage cluster.

5) For the fragment file, because the fragment file is organized by a plurality of files, the fragment file is guaranteed to be stored in the same Object storage cluster when written as much as possible, therefore, the position information of the first fragment of the file needs to be checked first, and double guarantee is checked through a metadata center and an Object HEAD (read operation of the Object) (namely, the fragment file is inquired first in the metadata center, and if the fragment file is directly returned; and continue with the read operation query of the object), other operations are similar to normal file (non-fragmented file) processing.

Meanwhile, all operation logs are written into the log retrieving engine in the working process, and the APM system monitors the working state of the inlet assembly, so that the reliability of the service of the inlet assembly is ensured.

And thirdly, a metadata center.

The metadata center is a core engine of a unified management platform, and self-develops a set of object storage bucket indexes aiming at billions of orders based on a CEPH Rados distributed storage gateway. Removing the relationship between the metadata center engine and the CEPH as shown in fig. 4, the bottom layer of the CEPH adopts a Bluestore distributed storage engine, and the metadata engine accesses and reads key-value by using a librados interface. The Metadata center adopts an RPC interface (RPC service realization) externally, and realizes a two-dimensional HASH algorithm and a multi-version flexible capacity expansion function (OSS Index Metadata realization) internally.

As shown in fig. 5, is a basic architecture of a metadata center. The bottom layer of the metadata center is accessed to a plurality of storage resource pools, and the storage resource pools comprise a barrel virtual mapping resource pool (RAODS MD Cluster) and an object index resource pool (MD _ index _ pool); the object index resource pool can perform cross-cluster resource pool capacity expansion to meet the requirement of large-scale smooth capacity expansion processing of metadata, and meanwhile, initialization can be performed in a single object storage cluster through a plurality of groups of resource pools (distributed block storage resources formed by a plurality of disks).

As shown in fig. 6, a schematic diagram of a two-dimensional HASH algorithm for a metadata center. The two-dimensional HASH algorithm includes: respectively calculating the row hash and the column hash of the object (specifically, calculating the row hash and the column hash according to the name of the object) to obtain a row module value and a column module value; hitting position information in the two-dimensional HASH matrix according to the row module values and the column module values; and adopting a B + tree insertion sorting method when the objects hit the same position, wherein key is the object name, and val is the information of the corresponding object storage cluster. The number of key values in each coordinate is controlled within 6-10 ten thousand, and the efficiency of the whole sequencing query and insertion is ensured.

As shown in fig. 7, it is a basic structure diagram of a multi-version of a metadata center. For a write data request, writing a latest version by default, wherein the version refers to the header information of the object metadata; for the query request, gradually querying from a high version to a low version; in multi-version capacity expansion, adjustment of the HASH module value is increased, on one hand, large-scale adjustment of a single version bucket is coped with, and on the other hand, query performance reduction caused by excessive versions is avoided.

It will be appreciated by those skilled in the art that adjustment of the HASH modulus value, which is adjustment of the HASH modulus parameter, can be achieved by conventional techniques.

Fourthly, the operation and maintenance component and the data migration component.

The operation and maintenance component is mainly responsible for the change operation of the entry component, the virtual mapping layer and the metadata center; the data migration component is to support smooth data migration between the plurality of object storage clusters.

The scheme of the embodiment of the invention mainly has the following beneficial effects:

1) for the object storage single-bucket capacity expansion method, data balance is avoided, the scale risk of a single cluster single bucket is avoided, and the capacity can be expanded horizontally naturally; during capacity expansion, the service is not sensed, and hot switch configuration is directly effective.

2) For the object storage single-bucket capacity expansion method, the performance bottleneck of a single cluster and a single bucket does not exist, and the performance reliability of the object storage service is ensured by accessing a plurality of heterogeneous object storage clusters and a virtual mapping method.

3) The requirement on the cluster type of the bottom-layer object storage cluster is low, bottom-layer data storage can be realized in a heterogeneous multi-object storage cluster mode, and the extended object storage cluster can be added into the cluster as a bottom-layer object storage cluster as long as the extended object storage cluster supports a swift interface or an S3 interface.

4) The metadata center fully utilizes the characteristics of an application scene, the problems of scale and difficulty in later-stage capacity expansion of the metadata center are mainly considered during design, and the single-barrel billion metadata storage scale is realized through multiple versions and multiple storage resource pools.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for expanding a billion-level bucket of objects, comprising: constructing a unified management platform comprising a virtual mapping layer, a metadata center, an entry component and an operation and maintenance component; wherein:

2. The method of claim 1, wherein the portal component comprises:

after intercepting a user request, determining an object corresponding to the user request, and authenticating whether a user key carried in the user request is legal or not through a service layer;

if the rule is legal, the routing layer processes, firstly determining the type of the user request and whether the related files are fragmented;

for PUT requests of non-fragmented files, acquiring the latest record information of corresponding objects through a metadata center;

for GET requests, HEAD requests or DELETE requests of non-fragmented files, transmitting RPC request services to a metadata center, checking whether the metadata center has record information of corresponding objects, if so, directly transferring to object storage clusters of corresponding bottom layers, if not, inquiring each object storage cluster of the bottom layers through the HEAD requests of bottom-pocket objects, writing inquired data back to the metadata center, and then forwarding the inquired data to the object storage clusters of the corresponding bottom layers;

for the fragment file, the position information of the first fragment of the file needs to be searched first, the file is inquired in a metadata center, if the file is inquired, the inquiry is continuously requested through the HEAD of the object, and other operations are similar to the processing of the non-fragment file;

meanwhile, all operation logs are written into a log retrieving engine in the working process, and an APM system monitors the working state of the inlet assembly;

wherein PUT is write-in or update operation, GET, HEAD, DELETE are query operation, read operation, DELETE operation, respectively.

3. The method of claim 1, wherein the object buckets are allocated to different levels of objects,

the object storage bucket and the heterogeneous object storage cluster at the bottom layer present a one-to-many relationship, and the related mapping relationship is stored and maintained by a virtual mapping layer; and when the water level height of one default written object storage cluster exceeds a set value, changing the default written object storage cluster in a hot effective mode, and forwarding the write flow of the current object storage bucket to the changed default written object storage cluster.

4. The method of claim 1, wherein the object buckets are allocated to different levels of objects,

the metadata center is a core engine of a unified management platform, an RPC interface is adopted externally by the metadata center, and a two-dimensional HASH algorithm and a multi-version flexible capacity expansion function are realized internally; the bottom layer of the metadata center is accessed to a plurality of storage resource pools, including a barrel virtual mapping resource pool and an object index resource pool; the object index resource pool can perform cross-cluster resource pool capacity expansion to meet the requirement of large-scale smooth capacity expansion processing of metadata, and meanwhile, large-scale initialization can be performed through multiple groups of resource pools in a single cluster.

5. The method of claim 4, wherein the two-dimensional HASH algorithm comprises:

respectively calculating the row hash and the column hash of the object according to the name of the object to obtain a row module value and a column module value; hitting position information in the two-dimensional HASH matrix according to the row module values and the column module values; and (3) hitting the same position, adopting a B + tree insertion ordering method, wherein key is the object name, and val is the corresponding cluster information.

6. The method of claim 4, wherein the metadata center, for the write data request, writes the latest version by default, where the version refers to the header information of the object metadata; for the query request, gradually querying from a high version to a low version; in multi-version expansion, adjustment of the HASH module value is increased.

7. A method of expanding a billion level bucket of objects according to any one of claims 1 through 7,

the data plane for uniformly managing a plurality of heterogeneous object storage clusters comprises the following steps: acquiring the actual object storage cluster position of the current object through the metadata center and the virtual mapping layer, if no record exists, starting the entrance component to inquire the actual object storage cluster position of the object, updating the actual object storage cluster position to the metadata center after related record information is inquired, and simultaneously forwarding the record information to the corresponding object storage cluster on the bottom layer;

the control plane for uniformly managing a plurality of heterogeneous object storage clusters comprises: and when the water level height of one default written object storage cluster exceeds a set value, smoothly switching to a new object storage cluster.