CN112148225B

CN112148225B - NVMe SSD-based block storage caching system and method thereof

Info

Publication number: CN112148225B
Application number: CN202011010192.6A
Authority: CN
Inventors: 鲍苏宁
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2023-04-25
Anticipated expiration: 2040-09-23
Also published as: CN112148225A

Abstract

The invention relates to a block storage caching system and a method thereof based on NVMe SSD, the system comprises a caching pool and a block storage, the caching pool comprises a control module, a caching pool distribution module, an SSD block management module, an elimination module, a read-write module and an NVMe SSD cache module, the NVMe SSD cache module comprises a plurality of SSD blocks with the same capacity, the bottom physical space of the block storage is composed of a mechanical hard disk, a plurality of block storage data blocks in the block storage are logically integrated into corresponding LUNs, the LUNs correspond to respective SSD block management modules and respective SSD block sets, and the caching pool distribution module is used for distributing SSD blocks to the LUNs; the SSD block management module is used for executing the operation of applying for the SSD blocks by the LUNs and scheduling the SSD block sets corresponding to the LUNs; the elimination module is used for eliminating, screening and recycling SSD blocks with read-write heat lower than a preset threshold value in the SSD block set corresponding to the LUN to the NVMe SSD cache module. Compared with the prior art, the invention can effectively improve the read-write performance of the block storage on the basis of ensuring low cost and large capacity.

Description

NVMe SSD-based block storage caching system and method thereof

Technical Field

The invention relates to the technical field of block storage caching, in particular to an NVMe SSD-based block storage caching system and a method thereof.

Background

With the rapid development of computer technology, most enterprises today use computers to operate core services, so that service data shows explosive growth. In order to ensure traceability of service data, many enterprises currently use a storage system to store the service data, and block storage is a very widely used storage mode.

HDD mechanical hard disks have been the most popular data storage medium due to their large capacity and low cost, and many blocks of storage use mechanical hard disks as their data storage medium. But it has not been able to meet the business' requirements for read-write performance due to the limitations of mechanical hard disks. In the prior art, a mechanical hard disk is replaced by a full flash memory system to improve the read-write performance, but the full flash memory system has higher cost and limited capacity. Therefore, it is necessary to design a cache system that combines the low cost, large capacity and high performance of the flash memory of the mechanical hard disk, so as to improve the read-write performance of the conventional block memory with lower cost.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a block storage cache system and a method thereof based on NVMe SSD, so as to improve the read-write performance of block storage and ensure the capacity of block storage under lower cost.

The aim of the invention can be achieved by the following technical scheme: the block storage caching system based on the NVMe SSD comprises a caching pool and a block storage, wherein the caching pool comprises a control module, a caching pool distribution module, an SSD block management module, an elimination module, a read-write module and an NVMe SSD caching module, the control module is respectively connected to the caching pool distribution module, the SSD block management module and the elimination module, the SSD block management module is respectively connected with the caching pool distribution module and the read-write module in a bidirectional manner, the caching pool distribution module is connected with the NVMe SSD caching module in a bidirectional manner, the NVMe SSD caching module is connected with the read-write module in a bidirectional manner, the NVMe SSD caching module comprises a plurality of SSD blocks with consistent capacity, the bottom physical space of the block storage is composed of a mechanical hard disk, a plurality of block storage data blocks in the block storage are logically integrated into corresponding LUNs (Logical Unit Number, logic unit numbers) so as to provide a virtual disk function, the LUNs and the read-write module are connected in a bidirectional manner, and the control module is used for creating a caching pool and carrying out SSD cache SSD set corresponding to the respective SSD block management module and the respective SSD block set, and carrying out initialization on the NVMe caching module to obtain a plurality of SSD with consistent capacity.

The buffer pool allocation module is used for allocating SSD blocks in the NVMe SSD buffer module to LUNs;

the SSD block management module is used for executing the operation of applying for SSD blocks by the LUNs and scheduling SSD block sets corresponding to the LUNs;

the elimination module is used for executing an elimination algorithm so as to screen and recycle SSD blocks with the read-write heat lower than a preset threshold value in the SSD block set corresponding to the LUN to the NVMe SSD cache module;

the read-write module is used for executing the operation of reading or writing IO data.

Further, the elimination algorithm is specifically an ARC (Adjustable Replacement Cache, adaptive replacement cache) elimination algorithm.

Further, the NVMe SSD cache module is specifically a Raid1 disk array formed by a primary NVMe SSD and a backup NVMe SSD.

The block storage caching method based on the NVMe SSD comprises the following steps:

s1, constructing a Raid1 disk array based on a main NVMe SSD and a standby NVMe SSD to serve as a cache pool device;

s2, initializing a cache pool and establishing a global cache pool allocation module;

s3, logically integrating a plurality of block storage data blocks in the block storage to create a corresponding LUN, and establishing an SSD block management module corresponding to the LUN, wherein each LUN is distributed with a plurality of SSD blocks, the SSD blocks jointly form an SSD block set of the LUN, the LUN can be mapped out through iSCSI or FC and used as block equipment, and data reading and writing of each LUN are mutually independent;

s4, the read-write module receives an IO data writing or IO data reading instruction, if the received IO data writing instruction is the IO data writing instruction, the step S5 is executed, and if the received IO data reading instruction is the IO data reading instruction, the step S6 is executed;

s5, searching whether corresponding data fragments or data fragments which can be combined exist in an SSD block set of the LUN to which the IO data belong according to the offset and the length, if so, judging whether the IO data to be written is larger than a preset penetration threshold value, if so, updating the corresponding data fragments, marking the dirty bit of the data fragments as 1, and waiting for a background refreshing thread to refresh the data from a cache pool to block storage; if the data is larger than the penetration threshold, merging the IO data, writing the IO data into a block memory, and marking the dirty bit of the original merged data fragment as 0;

if the IO data length to be written is not found, judging whether the IO data length to be written is larger than a preset penetration threshold, if the IO data length is larger than the penetration threshold, directly writing the IO data length into a block storage, if the IO data length is smaller than the penetration threshold, applying a new SSD block to a cache pool allocation module by an SSD block management module, writing the IO data into the newly applied SSD block to form a data segment therein, and marking the dirty bit of the data segment as 1, wherein if no allocable SSD block exists in a current NVMe SSD cache module, the eliminating module eliminates and screens the SSD block which can be reallocated from an SSD block set of the LUN with the largest number of the SSD blocks currently occupied;

s6, firstly searching whether corresponding data fragments exist in an SSD block set of the LUN to which the IO data belong according to the offset and the length, and if so, reading corresponding data from the SSD block;

if the current NVMe SSD cache module does not have the allocable SSD blocks, the eliminating module eliminates and screens the SSD blocks which can be allocated again from the SSD block set of the LUN which currently occupies the SSD blocks in the maximum number.

Further, the specific process of initializing the buffer pool in step S2 is as follows:

s21, logically dividing the Raid1 disk array into a plurality of SSD blocks according to a preset capacity size, and writing mark information and detailed information of the SSD cache device in a first sector of a first SSD block, wherein the steps include: the total size, block size, and number of blocks of the Raid1 disk array;

s22, initializing a bitmap according to the size of the Raid1 disk array device from 4096 byte offset of the first SSD block, wherein each SSD block corresponds to one bit, setting the bit corresponding to the SSD block occupied by the bitmap space to be 1, and setting the bit of other unused SSD blocks to be 0.

Further, the preset capacity is specifically 1MB.

Further, in the step S5, the background refreshing thread is specifically a plurality of data reading and writing threads for synchronizing the buffer pool data to the block storage, and the specific working process of the background refreshing thread is as follows:

when data is written, the dirty bit of the corresponding data segment is marked as 1, then a proper data synchronization thread is selected according to the current load condition of each data synchronization thread, relevant information of an SSD block containing the data segment is added into a data synchronization queue of the data synchronization thread, and when the subsequent data synchronization thread processes the SSD block, the data of all the data segments needing to be synchronized are synchronized into a block storage;

when the data in all the data fragments in the SSD block are synchronized into the block store, the dirty bit for each data fragment is marked as 0.

Further, the step S5 specifically includes the following steps:

s51, inquiring a data segment corresponding to IO data from an SSD block set corresponding to the LUN according to the offset and the length of the IO data to be written, if so, executing a step S52, and if not, executing a step S55;

s52, judging whether the IO data is larger than a penetration threshold value, if so, executing a step S53, otherwise, executing a step S54;

s53, reading a data segment cached in an SSD block, merging the data segment with current IO data, writing the merged IO data into a block storage, updating the cached data of the SSD block, and marking the dirty bit of the merged data segment cached in the SSD block as 0;

s54, directly updating data cached by the SSD block, marking the dirty bit of the corresponding data fragment in the SSD block as 1, and then adding the SSD block into a data synchronization queue of a synchronization thread;

s55, judging whether the IO data is larger than a penetration threshold value, if so, directly writing the IO data into a block memory, otherwise, executing step S56;

s56, the SSD block management module applies for new SSD blocks from the cache pool allocation module, the cache pool allocation module searches whether the allocatable SSD blocks exist in the NVMe SSD cache module, if not, the step S57 is executed, and if so, the step S58 is executed;

s57, eliminating and screening SSD blocks with read-write heat lower than a preset threshold value from an SSD block set of the LUN with the largest number of SSD blocks currently occupied by the eliminating module, recovering the SSD blocks by the cache pool allocation module, putting the eliminated and screened SSD blocks into the NVMe SSD cache module again, and then executing step S58;

s58, the buffer pool allocation module allocates a new SSD block to the LUN, writes IO data into the new SSD block, marks the dirty bit of the corresponding data segment as 1, and then adds the new SSD block into the data synchronization queue of the synchronization thread.

Further, the step S6 specifically includes the following steps:

s61, inquiring data fragments corresponding to the IO data from an SSD block set corresponding to the LUN according to the offset and the length of the IO data to be read, if so, executing a step S62, and if not, executing a step S65;

s62, judging whether the data cached in the SSD block can be filled with the current IO request, if so, directly reading the data from the SSD block, otherwise, executing the step S63;

s63, reading out uncached data in the SSD block from the block storage, updating the SSD block, and filling the read data into a memory of a read IO;

s64, updating the read-write heat of the SSD block;

s65, reading data from the block storage, filling the read data into a memory of a read IO, applying a new SSD block to a cache pool allocation module by an SSD block management module, caching the read data to the newly applied SSD block, updating an SSD block set of the LUN, and adjusting the read-write heat of the SSD block.

Further, in the step S65, the SSD block management module applies a new SSD block to the buffer pool allocation module, which specifically includes the following steps:

s651, the SSD block management module applies for new SSD blocks from the cache pool allocation module, the cache pool allocation module searches whether the allocable SSD blocks exist in the NVMe SSD cache module, if not, the step S652 is executed, and if so, the step S653 is executed;

s652, eliminating and screening SSD blocks with read-write heat lower than a preset threshold value from an SSD block set of the LUN with the largest number of SSD blocks currently occupied by the eliminating module, recycling the SSD blocks by the cache pool distribution module, putting the eliminated and screened SSD blocks into the NVMe SSD cache module again, and then executing step S653;

s653, the buffer pool allocation module allocates a new SSD block to the LUN.

Compared with the prior art, the invention has the following advantages:

1. the invention combines the characteristics of low cost and large capacity of a mechanical hard disk and the characteristic of high performance of an NVMe SSD, adopts the mechanical hard disk as a bottom physical space of block storage, builds a cache pool based on the NVMe SSD, adopts the LUN to logically integrate a database of the block storage, so that the block storage can provide the function of a virtual disk to the outside, simultaneously ensures the independence of data reading and writing among the LUNs, improves the writing performance of IO by writing IO data which does not exceed a penetration threshold into the NVMe SSD, and simultaneously improves the speed of reading IO by utilizing the cached and written data in the NVMe SSD, thereby effectively improving the reading and writing performance of the block storage.

2. The invention adopts a background refreshing thread mode, can refresh data in the NVMe SSD cache pool into the block storage in real time, and in addition, the invention is based on a elimination algorithm mechanism to eliminate and screen SSD blocks with lower reading and writing heat, so that new SSD blocks can be reassigned to LUNs with needs in the follow-up process, the cache pool can be ensured to be continuously updated and adjusted to be suitable for the cache needs of new data, thereby fully and reasonably utilizing system resources and ensuring the continuous optimization and improvement of the performance of the whole block storage cache system.

3. According to the invention, the penetration threshold is set so as to pre-filter larger IO data in the process of writing the IO data, so that the cache pool is prevented from being filled with the large IO data quickly, and the normal and reliable operation of the cache pool is ensured.

Drawings

FIG. 1 is a schematic diagram of a system architecture of the present invention;

FIG. 2 is a schematic flow chart of the method of the present invention;

FIG. 3 is a flow chart of data writing in an embodiment;

FIG. 4 is a schematic flow chart of data reading in the embodiment;

the figure indicates: 1. the system comprises a cache pool, 11, a control module, 12, a cache pool allocation module, 13, an SSD block management module, 14, a elimination module, 15, a read-write module, 16, an NVMe SSD cache module, 2, a block storage module, 21 and a mechanical hard disk.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

Examples

As shown in fig. 1, the block storage system based on the NVMe SSD is composed of a buffer pool 1 and a block storage 2, wherein the buffer pool 1 includes a control module 11, a buffer pool allocation module 12, an SSD block management module 13, a cull module 14, a read-write module 15 and an NVMe SSD buffer module 16, and the NVMe SSD buffer module 16 is specifically a Raid1 disk array composed of a main NVMe SSD and a standby NVMe SSD.

The buffer pool 1 performs data interaction with the block storage 2 through the read-write module 15, the bottom physical space of the block storage 2 is formed by a mechanical hard disk 21 (HDD in fig. 1), specifically, the control module 11 is respectively connected to the buffer pool distribution module 12, the SSD block management module 13 and the elimination module 14, the SSD block management module 13 is respectively connected with the buffer pool distribution module 12 and the read-write module 15 in a bidirectional manner, the buffer pool distribution module 12 is connected with the NVMe buffer module 16 in a bidirectional manner, the NVMe SSD buffer module 16 is connected with the read-write module 15 in a bidirectional manner, the NVMe SSD buffer module 16 comprises a plurality of SSD blocks with the same capacity, the plurality of block storage data blocks in the block storage 2 are logically integrated into corresponding LUNs, a virtual disk function is provided for the outside, the LUNs are connected with the read-write module 15 in a bidirectional manner, the LUNs are corresponding to the respective SSD block management module 13 and the respective SSD block set, the control module 11 is used for creating the buffer pool, and initializing the NVMe buffer module 16 to obtain a plurality of SSD blocks with the same capacity;

the buffer pool allocation module 12 is configured to allocate SSD blocks in the NVMe SSD buffer module 16 to LUNs;

the SSD block management module 13 is used for executing the operation of applying for SSD blocks by the LUNs and scheduling SSD block sets corresponding to the LUNs;

the elimination module 14 is configured to execute an elimination algorithm to screen and recycle SSD blocks with read-write heat lower than a preset threshold in the set of corresponding SSD blocks of the LUN to the NVMe SSD cache module 16, in this embodiment, the elimination algorithm adopted by the elimination module 14 is an ARC (Adjustable Replacement Cache, adaptive replacement cache) elimination algorithm, where the ARC elimination algorithm is a common page replacement algorithm, and the cache blocks (i.e., SSD blocks in the present invention) can be managed according to time, or according to frequency, and an optimization strategy can be automatically adjusted between the two blocks, and the present invention uses read-write access of the SSD blocks as heat to perform elimination screening on the SSD blocks;

the read-write module 15 is used for performing an operation of reading or writing IO data.

The specific working process of the system applied to practice is shown in fig. 2, and the system comprises the following steps:

When the method is applied, two NVMe SSDs are adopted as the primary and the standby respectively to form raid1 equipment/dev/dm 0, then the NVMe/dm 0 is logically divided into a plurality of blocks according to 1MB, and the first sector of the first block is written with marking information and detailed information of the SSD cache equipment, including: initializing bitmaps according to the size of the/dev/dm 0 device from 4096 byte offset of the first block, setting bit positions corresponding to blocks occupied by the bitmap space as 1 and bit positions of other unused blocks as 0, and completing initialization of the cache pool.

In the execution of the read-write workflow, as shown in fig. 3 and 4, the embodiment mainly includes the following steps:

1. NVMe SSD cache pool creation phase: two NVMe SSD disks are configured on a storage node to form raid1 equipment/dev/dm 0. After the cache device is selected, initializing the/dev/dm 0, filling corresponding information into the SSDCacheManager (namely the cache pool allocation module 12) for global cache block management, and after initialization, using the cache pool.

2. Creating a LUN stage: the block storage back end creates a virtual LUN, the LUN can be mapped out through iSCSI or FC and used as block equipment, and the data reading and writing of each LUN are mutually independent. At the same time as the block storage backend creates a LUN, a corresponding SSD block management module 13 is created.

3. Write data phase (as shown in fig. 3): when IO data is written, firstly searching whether corresponding data fragments exist or can be combined according to the offset and the length in an SSD block set of the LUN to which the IO belongs, if so, updating the corresponding data fragments, marking the dirty bit of the data fragments as 1, and waiting for a background brushing thread to brush from a cache pool to a storage pool of block storage; if the length is not found, judging whether the length is larger than the set penetration threshold, if the length is larger than the penetration threshold, directly writing the length into a storage pool for storing the blocks, if the length is smaller than the penetration threshold, applying for a new SSD block from the SSDCacheManager, writing IO data into the SSD block to form a data fragment therein, marking the dirty bit of the data fragment as 1, and putting corresponding cache block information into a synchronous queue of a data synchronous thread.

4. Read data phase (as shown in fig. 4): when data is read from a block storage system, firstly searching whether corresponding data fragments exist in an SSD block set of the LUN to which the IO belongs according to the offset and the length, and if so, reading corresponding data from the SSD block; if the data is not found, the data is read from a storage pool of the block storage, a new SSD block is applied to the SSDCacheManager, and the read data is cached into the SSD block.

In the above steps, the data segment refers to a continuous data area in an SSD block, where the SSD block is in units of 1MB, but the length of the write IO may be less than 1MB, and the data segment is generated. And merging of data fragments means that if two data fragments are consecutive or have overlapping portions in the same SSD block, the two fragments may be merged into a new data fragment.

The dirty bit is a flag bit for marking whether the data segment in the SSD block has new data written. If new data is written into the SSD block, the dirty bit of the data segment is marked as 1, which indicates that the data in the segment needs to be synchronized into the block storage pool, and after synchronization is completed, the dirty bit needs to be reset to 0.

It should be noted that, 1, when data is read and written, since the SSD cache Chi Yuan is smaller than the storage pool of the block storage, a data synchronization policy needs to be provided to synchronize the data in the SSD cache pool to the storage pool of the block storage, so that the SSD cache pool adopts a real-time synchronization policy to synchronize the data in the SSD cache pool to the block storage pool in real time, so as to make full use of the SSD cache pool and the block storage pool, specifically, when data is written, the background flushing the cache pool data to the storage pool of the block storage mainly includes the following processes:

1) After the NVMe SSD cache pool works, a plurality of data synchronization threads are started to wait for synchronous data.

2) When data is written, the dirty bit of the corresponding data segment is marked as 1, then an appropriate data synchronization thread is selected according to the current load condition of each data synchronization thread, relevant information of an SSD block containing the data segment is added into a data synchronization queue of the data synchronization thread, and when the subsequent data synchronization thread processes the SSD block, data of all the data segments needing to be synchronized are synchronized into a storage pool of a block storage.

3) When the data in all the data segments in the SSD block are synchronized into the storage pool of the block storage, the dirty bit of each data segment is marked as 0.

2. When the application of a new SSD block fails, a corresponding elimination mechanism needs to be provided to eliminate the SSD block which is not frequently accessed in the SSD cache pool into a storage pool of the block storage to make room for caching new data, so that each LUN is independently adopted by the invention to manage the adjustment and elimination of the SSD block by adopting an ARC elimination algorithm, and the SSD block which is not frequently accessed is eliminated according to the read-write heat of each SSD block in the SSD cache pool, wherein the ARC elimination algorithm, namely Adjustable Replacement Cache, is a common page replacement algorithm, can manage the cache block according to time, can manage the cache block according to frequency, and can automatically adjust an optimization strategy between the ARC elimination algorithm and the ARC elimination algorithm. The invention uses the read-write access of the SSD cache pool block as the heat, eliminates the SSD cache pool block, and the specific implementation process is that an SSDCacheManager takes charge of the allocation and recovery of all SSD blocks in the SSD cache pool, each LUN is added into an SSD block set after applying for the SSD block, and the subsequent adjustment and elimination are scheduled by the LUN.

3. Because the performance difference between the HDD mechanical hard disk and the NVMe SSD is not large when writing large IO, a penetration threshold is required to be set to filter the large IO, and the SSD cache pool is prevented from being quickly filled with the large IO.

In summary, the invention combines the characteristics of low cost and large capacity of the HDD mechanical hard disk and the characteristics of high performance of the NVMe SSD to design a cache system suitable for block storage. And writing the IO which does not exceed the penetration threshold into the NVMe SSD to improve the writing performance of the IO, and simultaneously, utilizing the written data in the NVMe SSD to provide reading IO acceleration, thereby effectively improving the reading and writing performance of the block storage. Meanwhile, by brushing data in the NVMe SSD cache pool to a storage pool for block storage in real time and combining an ARC elimination mechanism, system resources can be fully and reasonably utilized, and continuous performance improvement is ensured.

Claims

1. The block storage caching method based on the NVMe SSD is applied to a block storage caching system and is characterized in that the block storage caching system comprises a caching pool (1) and a block storage (2), the caching pool (1) comprises a control module (11), a caching pool allocation module (12), an SSD block management module (13), an elimination module (14), a read-write module (15) and an NVMe SSD caching module (16), the control module (11) is respectively connected to the caching pool allocation module (12), the SSD block management module (13) and the elimination module (14), the SSD block management module (13) is respectively connected with the caching pool allocation module (12) and the read-write module (15) in a bidirectional manner, the caching pool allocation module (12) is connected with the NVMe caching module (16) in a bidirectional manner, the NVMe caching module (16) comprises a plurality of SSD blocks with the same capacity, the bottom physical space of the block storage (2) is composed of a mechanical hard disk (21), the SSD blocks (2) are correspondingly connected with the LUN storage blocks (15) in a bidirectional manner, the logical LUN storage blocks (15) are correspondingly connected with the LUN storage modules (15) in a bidirectional manner, the control module (11) is used for creating a cache pool (1) and initializing the NVMe SSD cache module (16) to obtain a plurality of SSD blocks with consistent capacity;

the buffer pool allocation module (12) is used for allocating SSD blocks in the NVMe SSD buffer module (16) to LUNs;

the SSD block management module (13) is used for executing the operation of applying for SSD blocks by the LUNs and scheduling SSD block sets corresponding to the LUNs;

the elimination module (14) is used for executing an elimination algorithm so as to screen and recycle SSD blocks with the read-write heat lower than a preset threshold value in the SSD block set corresponding to the LUN to the NVMe SSD cache module (16);

the read-write module (15) is used for executing the operation of reading or writing IO data;

the block storage caching method comprises the following steps:

2. The block storage caching method based on NVMe SSD of claim 1, wherein the elimination algorithm is specifically an ARC elimination algorithm.

3. The block storage caching method based on the NVMe SSD according to claim 1, wherein the NVMe SSD caching module (16) is specifically a Raid1 disk array formed by a primary NVMe SSD and a backup NVMe SSD.

4. The block storage caching method based on NVMe SSD of claim 1, wherein the specific process of initializing the cache pool in step S2 is as follows:

5. The block storage caching method based on NVMe SSD of claim 4, wherein the preset capacity is specifically 1MB.

6. The block storage caching method based on the NVMe SSD of claim 1, wherein the background brushing thread in step S5 is specifically a plurality of data reading and writing threads for synchronizing the cache pool data to the block storage, and the specific working process of the background brushing thread is as follows:

7. The block storage caching method based on NVMe SSD of claim 1, wherein the step S5 specifically includes the steps of:

8. The block storage caching method based on NVMe SSD of claim 1, wherein the step S6 specifically includes the steps of:

s64, updating the read-write heat of the SSD block;

9. The block storage caching method based on NVMe SSD of claim 8, wherein the SSD block management module applies for a new SSD block to the cache pool allocation module in step S65, specifically comprising the steps of:

s653, the buffer pool allocation module allocates a new SSD block to the LUN.