CN107967124B

CN107967124B - Distributed persistent memory storage system and method

Info

Publication number: CN107967124B
Application number: CN201711344457.4A
Authority: CN
Inventors: 刘鹏; 张真; 王昌淦; 章亮; 王义飞; 王小聪
Original assignee: Nanjing Innovative Data Technologies Inc
Current assignee: Nanjing Innovative Data Technologies Inc
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2021-02-05
Anticipated expiration: 2037-12-14
Also published as: CN107967124A

Abstract

A distributed persistent memory storage system and method is based on persistent memory design, a distributed storage method of storing data on different nodes in a scattered manner is adopted, I/O operation is carried out on bare equipment in a mode of directly managing bare disks, relevant metadata is stored in a key/value database in a key/value mode, relevant operation of a user on the key/value database is abstracted into a corresponding interface, a small file system is directly connected in an abutting joint mode, all metadata are loaded into a memory, data and log files of the file system are stored on the bare equipment through block equipment, and therefore the file system and the system can share the bare equipment and can also respectively designate different equipment. The invention combines the distributed storage mode, abandons the local file system, further reduces the expense of the file system and fully exerts the performance advantage of the persistent memory.

Description

Distributed persistent memory storage system and method

Technical Field

The invention belongs to the field of cloud storage of distributed big data, and particularly relates to a distributed persistent memory storage system and method.

Background

In the present society, information data is in explosive growth, and the explosive data growth causes multiple problems such as too low speed of data in the processing process, and a storage system based on the traditional storage technology can not meet the requirements of a computer system on performance and power consumption gradually. However, as the price of the memory is lower and lower, the capacity of the memory is larger and larger, and the reading and writing speed of the memory is higher than that of the magnetic disk by more than one order of magnitude, feasibility is provided for developing a distributed persistent memory storage system to solve mass data processing and storage.

In a traditional cloud storage system, a centralized storage node is usually adopted to store all data, so that the storage node becomes a bottleneck of system performance, becomes a focus of reliability and safety, and cannot meet the requirement of large-scale storage application. In addition, although the mass data storage can be realized by using the mechanical disk as the storage medium of the data, and the cost is low, before the data is written, the duplicate log needs to be written first, so that the writing amplification is doubled, and under the condition of mass data, the method can cause huge waste on system resources, consume a large amount of time, and greatly reduce the working efficiency.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: firstly, a traditional cloud storage system usually adopts a table look-up mode when positioning a data storage position, which is easy to cause system bottleneck; secondly, in order to support transactions, a log mechanism is introduced in a conventional cloud storage system interface, that is, all write operations need to write logs (in a log file system XFS mode) first and then write to a local file system. That is, when the system performs the write operation, one data needs to be written twice, namely, the log and the local file system, which causes that the throughput of the disk output is only half of the physical performance of the disk output in the case of large-scale continuous IO; thirdly, one IO needs to be completed through a plurality of modules such as a kernel module, bare equipment and a storage engine, queue and thread switching is involved among the modules, and memory copying is needed when part of the modules process the IO, so that the IO path is too long, and the overall performance is not high; fourthly, the single request queue of the traditional IO interface standard on the traditional block layer usually has the problems of request queue lock, hardware interruption and remote memory access, which results in poor expandability; fifth, most conventional cloud storage systems are based on mechanical hard disk design, which results in that the physical performance of the system on persistent memory cannot be fully utilized, especially latency and IOPS.

Aiming at the defects in the prior art, the invention provides a distributed persistent memory storage system and a distributed persistent memory storage method.

In order to achieve the purpose, the invention adopts the following technical scheme:

a distributed persistent memory storage system, comprising: a key/value database, an operation abstract interface, a file system and a bottommost block device; the key/value database stores pre-written logs, data object metadata, addressing data information and distributor metadata; in order to interface an abstract interface of operation, a small file system is realized, and the distribution and management of metadata, file space and disk space are realized; the distributor employs computational addressing algorithms to decide where the real data should be stored; when the data are stored, the data are directly written into the bare device through the distributor, the metadata are stored into a key/value database in a key/value mode, the small file system is directly connected with the small file system through a related operation abstract interface, and the metadata are stored on the bare device through the block device.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the key/value database takes a hash table as a data structure.

And the related operation of the key/value database is packaged into an interface, and the small file system is directly connected with the interface, so that the packaging of the bottom layer system is provided for the key/value database.

The system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory.

The small file system comprises a client and a server, wherein the server is used for storing data, the client receives file operation commands intercepted from the virtual file system VFS and processes the file operation commands, and the connection between the client and the server is realized through a Remote Procedure Call (RPC) protocol packet.

The calculation addressing algorithm specifically includes: striping the file according to the preset or required data, numbering the separated objects, and obtaining a unique identifier for each object; performing hash calculation on the objects according to the identifiers to ensure that the objects are uniformly distributed into different virtual nodes; uniformly distributing and storing the data of the virtual nodes containing the objects according to the equipment weight, the real-time node computing resources and the node network resources; in the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.

In addition, a storage method of the distributed persistent memory storage system is also provided, which is characterized by comprising the following steps:

when a system performs writing operation, whether I/O is aligned or not is judged according to the minimum allocation unit, the aligned writing request processing generates a logical data block and a storage container of a binary file according to the actual size of metadata, the area spanned by the data block is an integral multiple of the minimum allocation unit, if the interval is written before, the previous data block is recorded to facilitate subsequent space recovery, and then the previous data block is written into the storage container of the binary file;

the non-aligned write request firstly searches whether a reusable binary file storage container exists or not according to the offset of the initial position, if the reusable binary file storage container is found, the operation of aligning and zero padding is firstly carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the operation of direct or overlay writing is judged;

when the direct writing operation is executed, the system can directly generate the logical data block to be put into the storage container of the binary file

When the operation of the overlay writing is executed, the system can align to the block size according to the offset and the length of the starting position, and if the overlay area is just aligned to the block size, the data does not need to be read from the bare device; if the covered area is not aligned to the block size, reading the unaligned part to form an aligned buffer area, then newly generating a logical data block, adjusting the original logical data block, recording the partial area to be recycled, and finally writing the partial area into a storage container of a binary file; for the case of overwrite, neither is it written directly to the bare device, but it is written to the key/value database through the pre-write log system.

If the reusable binary file storage container is not found, the system performs zero padding alignment operation according to whether the offset and the length of the initial position are aligned to the block size, and finally, the offset and the length of the aligned initial position are used as data blocks and are further placed in the binary file storage container.

The invention has the beneficial effects that:

as can be seen from the above table, a general enterprise may consider the cost of the storage system when selecting the storage system, and the conventional cloud storage system generally uses a mechanical hard disk such as a disk and a tape library as a storage medium, although the cost of the conventional cloud storage system is low, the performance efficiency is also reduced accordingly. The distributed persistent memory storage system designed by the invention takes nonvolatile memory devices such as flash memories, SSDs and the like as storage media, and the cost is acceptable by most enterprises, so that the improvement of the overall performance efficiency of the system at reasonable cost becomes possible.

Drawings

FIG. 1 is a block diagram of the overall architecture of the distributed persistent memory storage system of the present invention.

FIG. 2 is a schematic diagram of the computational addressing algorithm of the distributed persistent memory storage system of the present invention.

FIG. 3 is a block diagram of the mini-file system of the distributed persistent memory storage system of the present invention.

FIG. 4 is a diagram of an application program interface multi-queue structure of the mini-file system of the distributed persistent memory storage system of the present invention.

FIG. 5 is a process diagram of a distributed persistent memory storage system implementing a read I/O operation in accordance with the present invention.

FIG. 6 is a flow chart of a write I/O operation implemented by the distributed persistent memory storage system of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

Aiming at the defects of data write amplification, too long IO path, insufficient support for high-performance hardware and the like of the traditional cloud storage system, the distributed persistent memory storage system shown in the figure 1 is provided and comprises a key/value database, an operation abstract interface, a file system and the bottommost block device. The key/value database stores pre-written logs, data object metadata, addressing data information, and metadata for the distributor (the distributor is responsible for deciding where the actual data should be stored). And the operation abstract interface realizes the encapsulation of the bottom system for the key/value database. In order to interface an abstract interface of operation, a small file system is realized, and the distribution and management of metadata, file space and disk space are solved. Because the local file system is abandoned, the user can directly realize the I/O operation block device file by the linux system.

The system adopts a calculation addressing mode to replace a table look-up mode in the traditional distributed storage to realize a distributed storage principle, data is directly written into bare equipment through a distributor during storage, metadata is stored into a key/value database through a key/value mode, and then a small file system is directly butted through a related operation abstract interface and is stored on the bare equipment through block equipment. The mass data writing speed is improved, the resource waste and the time consumption are reduced, the automatic load balance of the storage nodes is realized, and the overall performance of the system is maximized.

The distributed storage mode that data are stored on different storage nodes in a scattered mode is adopted, the storage load is shared by the different storage nodes, the storage information is positioned through a calculation addressing algorithm, the reliability, the availability and the access efficiency of the system are improved, and the system is easy to expand.

The system abandons ext4/xfs and other local file systems, adopts a mode of directly managing bare equipment, and the direct management of the bare equipment necessarily needs to carry out space management of the bare equipment, so that a corresponding distributor is developed, and a calculation addressing algorithm is adopted to determine where real data should be stored.

Metadata is needed to locate the mapping position of the data, and therefore, the storage of the metadata is very important. According to the system, the metadata is stored in the key/value database in a key/value form through the key/value database with the hash table as the data structure, so that the reading speed of the metadata is improved. In order to facilitate the realization of the key/value database by a user, the system encapsulates the relevant operation of the key/value database into an interface, and directly interfaces with a small file system to provide encapsulation of a bottom layer system for the key/value database. The system also introduces a small file system, which ensures that the metadata can be better stored on the block device and adopts a brand-new application program interface standard.

The system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory, and has the characteristics that data can still be stored after power failure.

In the computational addressing algorithm shown in fig. 2, a file is first fragmented, i.e. striped, according to a preset or required data, and the separated objects are numbered, and each object obtains a unique id. Secondly, the objects are subjected to hash calculation according to the id, and the objects are guaranteed to be uniformly distributed into different virtual nodes. And finally, uniformly distributing and storing the data of the virtual nodes containing the object according to the calculation addressing algorithm and the equipment weight, the real-time node calculation resource, the node network resource and the like. In the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.

As shown in fig. 3, the mini file system developed by the present system is composed of a client and a server. The server is used for storing data, the client receives and processes file operation commands intercepted from the virtual file system VFS, and the connection between the client and the server is realized by Remote Procedure Call (RPC) protocol packets. The working flow of the small file system is as follows:

firstly, the application program needs to call a file storage function of the virtual file system VFS, and the VFS helps the application program to store files under a folder of the small file system. In addition, the operation of the VFS is transmitted to a kernel module of the small file system, and the kernel module receives the storage command of the VFS and transmits the storage command to the client, and the client is responsible for storing the data transmitted by the kernel. The client adds data storage commands to the write queue of the respective device (including the service) and performs the write operation at the appropriate time.

When the system wants to read data, the application program only needs a VFS read command, the VFS will naturally tell the kernel module of the mini file system that the system needs to read the file, and the kernel module will tell the client that the client needs to read the file according to its own scheme, and the client will fetch the file from a certain device (which may be a service, and the communication with the server is performed through an RPC packet), and then send the file back to the application program through transmission.

As shown in FIG. 4, the system adopts a completely new interface standard in terms of the application program interface of the small file system. In the IO stack, after a bio request is submitted to a block layer by a submit _ bio layer, the bio request is simply checked and processed by a generic _ make _ request function and a _ generic _ make _ request function provided by the block layer, so that the bio request is submitted to a driver layer, and a make _ request provided by an interface standard driver is registered to process the function, wherein the function is processed by adding the request to a request queue corresponding to a CPU (Central processing Unit), a command format required by an interface standard protocol is generated for the request, and a hash table for DMA is prepared for the command. And finally, informing the equipment to take the newly submitted command in the request queue in a PIO mode, after the equipment finishes the received command, firstly submitting the command completion state information to a completion queue of a CPU corresponding to the host, then informing the host of IO completion in an interruption mode, and carrying out response processing by the host according to the completion condition of the command.

The distributed persistent memory storage system is specifically designed by the following method:

and configuring a weight for each device when the cluster is constructed, and calculating the distribution of the data objects by adopting a calculation addressing algorithm through the weight of each device and real-time storage node load resources. The distribution of the objects is mainly determined by a cluster load state diagram and a distribution strategy, wherein the cluster load state diagram describes available resources, real-time load and a hierarchical structure of the system, the cluster load state diagram comprises the number of racks and storage node hard disks, and the distribution strategy specifies the storage strategy of the storage pool, and the storage strategy comprises copies, erasure codes and storage limits of the copies and the erasure codes, including whether the copies are distributed in different racks, different servers and the like. The set of data x to storage devices is calculated by these composite factors: ala (x) ═ e (devicel, device2, …, deviceN).

Considering that the method of directly managing bare devices by local file systems such as ext4/xfs and the like is abandoned, the system necessarily needs to perform space management on the bare devices, and the system adopts a distributor to determine where real data should be stored by a calculation addressing algorithm so as to realize finer-grained writing.

In terms of metadata storage, the system adopts a key/value database, so that metadata is stored in the database in the form of key/value. There will be multiple logical blocks of data in a metadata, recorded by one byte addressing. The logical data blocks are mapped to the storage container of the binary file by the id of the storage container of the binary file, and the storage container of the binary file is mapped to the area on the actual physical disk by the offset and length of the start position.

The system encapsulates the operation of the user on the key/value database into an interface which is used for connecting the key/value database and the file system and persistently stores the metadata in the log. When the system starts to mount the file system, all metadata can be stored on the bare device through the block device only by rewriting the log.

In order to fully mine the performance of the current or future persistent memory, the system adopts a brand-new interface standard on an application program interface of a small file system, defines a special command set and an optimized interface register for the persistent memory, and adopts a plurality of submission queues and a completion queue on a host end to provide parallelism and expandability.

Firstly, for a persistent memory with high random access performance, an IO scheduler is shielded at a block layer based on an IO stack of a brand-new interface so as to reduce delay brought by the IO scheduler to operations such as request merging and reordering. Secondly, in the interface drive, the client realizes a pair of management request and completion queue, and can support a plurality of pairs of IO submission and IO completion queues, and the submission queue and the completion queue can be expanded along with the number of cores of the CPU, thereby reducing competition of queue locks, improving cache hit rate, improving parallelism, fully utilizing the bandwidth of hardware, and having good expandability.

The system directly exposes the persistent memory chip to the CPU, namely directly puts the persistent memory chip on the memory bus, so that the CPU can address the memory storage chip through the common load and the storage, thereby ensuring that the small file system can operate in an atomic mode and ensuring the program execution sequence and the operating consistency of the file system. This architecture allows access latency to be reduced and the byte addressing characteristics of the memory storage chip to be exploited.

In addition, in order to prevent that random reading and writing and large cache aggravate the cache size and cause repeated reading and writing when the performance of the DRAM and the memory are similar, and influence the performance of the file system, the small file system eliminates the cache based on the DRAM and adopts a higher-speed CPU cache, so that the performance of the file system is greatly improved.

And adopting a persistent memory for selection of the bare equipment. Compared with a volatile memory, the persistent memory has the advantages of high speed, high density, miniaturization, low power consumption, radiation resistance, capability of still keeping data after power failure and the like, effectively improves the throughput rate and the overall performance of a system, and ensures the reliability of the system. The non-volatile memories used in the system are mainly classified into two types:

the first type is NVM main memory, which is characterized in that the traditional DRAM memory can be directly replaced or the traditional DRAM memory and the traditional DRAM memory can be used simultaneously to form a mixed main memory, and the advantages are that the NVM main memory is controlled by hardware, only the main memory controller needs to be properly improved according to the characteristics of the NVM main memory, the upper layer application is transparent, and the system level (operating system, file system and the like) is not required to be changed. The NVM main memory structure mainly utilizes the advantage of low static power consumption of a persistent memory, simultaneously meets the performance requirement of the traditional DRAM, and improves the storage density as much as possible.

The second type is storage class memory, which is a generic term for storage devices used in the storage hierarchy between the traditional DRAM main memory and HDD external memory. Compared with the HDD external memory, the storage-class memory has the advantages of no moving part, low delay and high throughput rate; compared with a DRAM main memory, the memory has the advantages of nonvolatility, low cost, low power consumption and the like.

By combining the two types of persistent memories, a user can select different types of memories according to different use scenes so as to exert the overall performance of the system.

Because the system adopts the persistent memory to support storage according to the computational addressing, the hardware is solidified by adopting a memory file system, internal data transmission of the system is carried out by using the RDMA technology based on the infiniband network, and an asynchronous model is adopted for a network model, so that the link concurrency is improved. Maximizing the speed of the memory chip.

As shown in fig. 5, when the system performs a read operation, it may read unwritten data during the search by searching for a data block of the associated logic, and the data performs a direct zero padding operation.

As shown in FIG. 6, when the system performs a write operation, it is determined whether the I/O is aligned according to the minimum allocation unit.

The aligned write request processing generates a logical data block and a storage container of the binary file according to the actual size of the metadata, the data block spans an area which is an integral multiple of the minimum allocation unit, and if the section is written before, the previous data block is recorded to be convenient for subsequent space recovery. And then written into the storage container of the binary file.

A non-aligned write request will first look for a binary file storage container that can be reused based on the offset of the starting location.

If the reusable binary file storage container is found, firstly, the zero padding operation is carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the direct or overwriting operation is judged.

When the direct writing operation is executed, the system can directly generate a logical data block to be placed in a storage container of the binary file; when performing an overwrite operation, the system will align to a block size (4 KB by default) depending on the offset and length of the starting location. If the covered area is just aligned to the block size, data does not need to be read from the bare device, but if the covered area is not aligned to the block size, the part which is not aligned needs to be read out to be spliced into an aligned buffer area, then a logical data block is newly generated, the original logical data block is adjusted, the part which needs to be recycled is recorded, and finally the part is written into a storage container of the binary file. For the case of overwrite, neither is it written directly to the bare device, but it is written to the key/value database through the pre-write log system.

If no reusable binary file storage container is found, the system performs zero padding alignment operation according to whether the offset and length of the starting position are aligned to the block size (default to 4KB), so padding is related to subsequent disc writing operation, and when the disc is actually written, there are two ways, namely a direct I/O way, that is, the offset and the buffer are required to be aligned, and another is a standard I/O way, which does not require that the offset and the buffer are aligned, but writes to the buffer, and then synchronizes to the disc, but reduces the writing efficiency. And finally, taking the offset and the length of the aligned initial position as data blocks, and further putting the data blocks into a binary file storage container.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A distributed persistent memory storage system, comprising: a key/value database, an operation abstract interface, a file system and a bottommost block device; the key/value database stores pre-written logs, data object metadata, addressing data information and distributor metadata; in order to interface an abstract interface of operation, a file system is realized, and the distribution and management of metadata, file space and disk space are solved; the distributor employs computational addressing algorithms to decide where data should be stored; when the data are stored, the data are directly written into the bare device through the distributor, the metadata are stored into a key/value database in a key/value form, and then are directly connected with a file system through a related operation abstract interface and are stored on the bare device through the block device; wherein, the calculation addressing algorithm specifically comprises: striping the file according to the preset or required data to obtain separate objects, numbering the separate objects, and obtaining a unique identifier for each object; performing hash calculation on the objects according to the identifiers to ensure that the objects are uniformly distributed into different virtual nodes; uniformly distributing and storing the data of the virtual nodes containing the objects according to the equipment weight, the real-time node computing resources and the node network resources; in the calculation process, the storage position of the data object is determined by the acquired cluster state diagram and the distribution strategy.

2. A distributed persistent memory storage system as claimed in claim 1, wherein: the key/value database takes a hash table as a data structure.

3. A distributed persistent memory storage system as claimed in claim 1, wherein: and the related operation of the key/value database is encapsulated into an interface, and the interface is directly connected with a file system to provide encapsulation of a bottom layer system for the key/value database.

4. A distributed persistent memory storage system as claimed in claim 1, wherein: the system adopts a nonvolatile memory device as a storage medium, namely an NVM main memory and a storage-level memory.

5. A distributed persistent memory storage system as claimed in claim 1, wherein: the file system comprises a client and a server, the server is used for storing data, the client receives file operation commands intercepted from the virtual file system VFS and processes the file operation commands, and the connection between the client and the server is realized by Remote Procedure Call (RPC) protocol packets.

6. A method of storing in a distributed persistent memory storage system according to any one of claims 1 to 5, comprising the steps of:

when a system performs writing operation, whether I/O is aligned or not is judged according to the minimum allocation unit, the aligned writing request processing generates a logical data block and a storage container of a binary file according to the actual size of metadata, the area spanned by the data block is an integral multiple of the minimum allocation unit, if the area is written previously, the previous data block is recorded to facilitate subsequent space recovery, and then the data block is written into the storage container of the binary file;

the non-aligned write request firstly searches whether a reusable binary file storage container exists or not according to the offset of the initial position of the binary file, if the reusable binary file storage container is found, the operation of aligning and zero padding is firstly carried out according to the block size, and then whether the free space of the binary file storage container can be directly used for distinguishing and carrying out the operation of direct writing or overwriting is judged;

when the direct writing operation is executed, the system can directly generate a logical data block to be placed in a storage container of the binary file;

when the operation of the overlay writing is executed, the system can align to the block size according to the offset and the length of the starting position, and if the overlay area is just aligned to the block size, the data does not need to be read from the bare device; if the covered area is not aligned to the block size, reading the unaligned part to form an aligned buffer area, then newly generating a logical data block, adjusting the original logical data block, recording the partial area to be recycled, and finally writing the partial area into a storage container of a binary file; for the case of overwriting, the data is not directly written into the bare device, but is written into a key/value database through a pre-write log system;