WO2016032955A2 - Nvram enabled storage systems - Google Patents

Nvram enabled storage systems Download PDF

Info

Publication number
WO2016032955A2
WO2016032955A2 PCT/US2015/046534 US2015046534W WO2016032955A2 WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2 US 2015046534 W US2015046534 W US 2015046534W WO 2016032955 A2 WO2016032955 A2 WO 2016032955A2
Authority
WO
WIPO (PCT)
Prior art keywords
nvram
storage
data
persistent
raid
Prior art date
Application number
PCT/US2015/046534
Other languages
French (fr)
Other versions
WO2016032955A3 (en
Inventor
Bruce Eric MANN
Matthew Edward Cross
Arthur James BEAVERSON
Bang Chang
Original Assignee
Cacheio Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cacheio Llc filed Critical Cacheio Llc
Publication of WO2016032955A2 publication Critical patent/WO2016032955A2/en
Publication of WO2016032955A3 publication Critical patent/WO2016032955A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/06Address interface arrangements, e.g. address buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates generally to storage systems, and, more specifically, to non-volatile random access memory (NVRAM) enabled storage systems.
  • NVRAM non-volatile random access memory
  • SSDs Solid-state devices
  • RAID Redundant Array of Independent Disks
  • the RAID system may take the form of a hardware RAID card, RAID on a Chip, software RAID, Erasure Coding, or JBOD (Just a Bunch of Disks).
  • Transactional applications typically issue read and write requests (I/O requests) that have small transfer sizes and are not in sequential block address order (collectively referred to as "random" I/O requests).
  • SSDs typically service random read requests many times faster than traditional hard disk drives (HDDs).
  • SSD write amplification also reduces write performance and SSD endurance.
  • An SSD is comprised of a plurality of flash pages. An entire flash page must be "erased" before it can be rewritten. There is a write cycle limit to how many times a flash page can be erased and rewritten.
  • a transactional application writes to an SSD in a RAID system
  • its write request size will likely be much smaller than the SSD's flash page size, resulting in partially written flash pages. Consequently the SSD has to perform garbage collection by moving user data from one partially written flash page to another until an entire flash page contains no more user data and can be erased.
  • Garbage collection turns each application write into multiple SSD writes, also known as write amplification. Given the write cycle limit on each flash page, write amplification significantly reduces SSD endurance and write performance. When an application or a storage system writes to an SSD in multiple small transfer sizes in sequential block address order (sequential writes), the SSD typically can fill entire flash pages with fewer partially written pages, reducing the write amplification during garbage collection.
  • An SSD typically comprises a persistent flash medium for storing data and a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
  • a persistent flash medium for storing data
  • a volatile memory to hold data temporarily before the data is committed to the persistent flash medium.
  • the data stored in the volatile memory will be lost.
  • some of the SSDs are equipped with a capacitor or battery, which provides enough power for flushing the data stored in the volatile memory to the persistent flash medium.
  • the additional capacitor can significantly increase the cost of the SSDs.
  • a storage system is configured to prevent data loss in the event of power failure.
  • the storage system comprises a processor, one or more storage devices and a non-volatile memory (NVRAM).
  • Each of the one or more storage devices comprises a persistent medium.
  • the NVRAM device is configured to store one or more data blocks to be sent to a storage device for persistent storage.
  • the processor is configured to check whether a data block stored on the NVRAM is also stored on the storage device's persistent medium before deleting the data block from the NVRAM.
  • a storage system is configured to reduce read-modify- write operations and write amplification.
  • the storage system comprises a processor, a RAID system with one or more storage devices, a NVRAM device and a memory.
  • the NVRAM device stores one or more data blocks that are to be sent to the RAID system for persistent storage.
  • the memory stores a metadata that maps every data block's logical block address (LBA) to its physical block address (PBA).
  • LBA logical block address
  • PBA physical block address
  • the processor is configured to handle random write requests from an application. When handling random write requests, the processor first stores a data block in the NVRAM, then
  • the full RAID stripe is written to the RAID system to reduce read-modify-write operations.
  • the metadata is updated to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks.
  • the one or more data blocks are deleted from the NVRAM after the metadata has been updated.
  • Figure 1 i lustrates a block diagram of a storage system with NVRAM devices
  • Figure 2 i lustrates a block diagram of a storage device with volatile memory and persistent medium.
  • Figure 3 lustrates the deferred write process at the NVRAM.
  • Figure 4 lustrates the deferred write process at the storage device.
  • Figure 5 lustrates a flow diagram of the check block persistent process.
  • Figure 6 lustrates a flow diagram of the make block persistent process.
  • Figure 7 lustrates a block diagram of a storage system with a RAID system.
  • Figure 8 lustrates a block diagram of RAID system data layout.
  • Figure 9 lustrates a flow diagram of NVRAM enabled writes to a RAID system.
  • Figure 10 illustrates a flow diagram of NVRAM enabled metadata updates.
  • a storage system has at least one NVRAM device to accomplish (1 ) preventing data loss in the event of a power failure; (2) reducing read-write-modify operations; and (3) reducing solid-state device write amplification.
  • Fig. 1 illustrates one embodiment of a storage system 100 that includes a processor 1 1 0 and one or more storage devices 120.
  • Examples of storage device include solid-state device (SSD), hard disk drive (HDD), and a combination of SSDs and HDDs (Hybrid).
  • the storage system 100 provides persistent storage to one or more user applications 140.
  • the storage device 120 may be accessible by multiple storage systems 100 as shared storage device.
  • the application 140 and the storage system 100 may be running on the same physical system.
  • the application 140 may access the storage system through a storage network such as FibreChannel, Ethernet, InfiniBand, and PCIe.
  • the processor 1 1 0 interfaces between the application 140 and the storage device 120.
  • the processor 1 1 0 controls and manages the storage device 120.
  • the processor 1 1 0 may provide a set of commands for the application 140 to read from and write to the storage device 120.
  • the processor 1 10 can provide redundancy, performance, and data services that often can't be achieved by the storage device 120.
  • the storage system 100 includes one or more non- volatile random-access memory (NVRAM) devices 1 30.
  • NVRAM non- volatile random-access memory
  • Examples of NVRAM include battery-backed DRAM, NVDIMM, PCIe NVRAM card, and solid-state device.
  • the processor 1 10 upon receiving a write request from the application 140, stores the write data in the NVRAM 130 and acknowledges to the application 140 that the write request is successful before the data is actually committed to the storage device 120. This process is known as deferred write.
  • Fig. 2 illustrates one embodiment of a storage device 1 20, such as a solid-state device (SSD), that comprises a persistent medium 172 for storing data and a volatile memory 174 for buffering data temporarily before the data is committed to the persistent medium 172.
  • SSD solid-state device
  • the data stored in the volatile memory 174 will be lost.
  • an SSD is equipped with a capacitor or battery, which provides enough power to write all the data in the volatile memory 174 to the persistent medium 172.
  • the capacitor or battery can significantly increase the cost of the SSD.
  • the present disclosure provides methods for preventing data loss during power failure without the additional capacitor or battery in the SSD.
  • Fig. 3 illustrates the deferred write process at the NVRAM 130:
  • Step 1 The processor 1 10 receives a write request from the application 140;
  • Step 2 The processor 1 10 commits the write data to the NVRAM 130;
  • Step 3 The processor 1 10 acknowledges to the application 140 that the write is successful
  • Step 4 At a later time the processor 1 10 writes the data in the NVRAM 130 to the storage device 120 (deferred write);
  • Step 5 At a later time the processor 1 10 deletes the data from the NVRAM 1 30 so the NVRAM space can be reused.
  • Fig. 4 illustrates the deferred write process at the storage device 120: Step 4': The storage device 120 receives a write request from the processor; Step 6: The storage device 1 20 stores the write data in its volatile memory 174; Step 7: The storage device 1 20 acknowledges to the processor 1 1 0 that the write is successful;
  • Step 8 At a later time the storage device 120 writes the data in the volatile memory 174 to its persistent medium 172
  • step 5 If a power failure takes place after step 5 but before step 8, the write data will be lost. In order to prevent data loss the present disclosure replaces step 5 with the following steps as illustrated in Fig. 5:
  • Step 510 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its NVRAM space can be reused;
  • Step 520 The processor 1 10 checks if the data block is on the storage device's persistent medium 172 by issuing a "check block persistent" request to the storage device 120;
  • Step 530 If the storage device responds "yes", the processor 1 10 deletes the data block from the NVRAM 130.
  • Step 540 If the storage device responds "no", the processor 1 10 issues a "flush block” request to the storage device 120. Upon receiving the request, the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172 and acknowledges completion to the processor 1 10;
  • Step 550 Upon receiving the acknowledgement the processor 1 10 deletes the data block from the NVRAM 130.
  • Fig. 6 illustrates another embodiment of the present disclosure by replacing step 5 with the following steps:
  • Step 610 The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its space can be reused;
  • Step 620 The processor 1 10 issues "make block persistent" request to the storage device 120. Upon receiving the request, the storage device 120 checks if the data block is on its persistent medium. If not the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172. The storage device then acknowledges completion to the processor 1 1 0. Step 630: The processor 1 10 receives the completion acknowledgement for "make block persistent" request;
  • Step 640 The processor 1 10 deletes the data block from the NVRAM 130.
  • the NVRAM device 130 is configured to be much larger than the aggregate size of the volatile memory in the storage devices 120. This allows the processor 1 10 to delay deleting a data from the NVRAM as much as possible so the data is more likely to have been flushed to the persistent medium 1 72 before a check block consistent or make block persistent request.
  • the storage device 120 always writes data from its volatile memory 174 to its persistent medium 172 first in first out (FIFO). In these embodiments the processor may calculate whether a data is on the persistent medium without the check block persistent or make block persistent request.
  • Fig. 7 illustrates one embodiment of a storage system 100 that includes a RAID (Redundant Array of Independent Disks) system 150 between the processor 1 10 and the storage device 120.
  • RAID system includes software RAID, hardware RAID card, RAID on a chip, Erasure Coding, or JBOD (Just a Bunch of Disks).
  • the RAID system may be configured in write through (WT) or write back (WB) mode.
  • the RAID system 150 virtualizes multiple storage devices 120 into logical units.
  • the RAID system 150 may be implemented to distribute data blocks across the storage devices 120 (i.e., striping) and generate checksums (i.e., parity) for data redundancy and recovery.
  • the RAID system introduces read- modify-write operations that reduce write performance.
  • the storage devices 120 are SSDs. Small transactional writes to a SSD causes write amplification, which reduces SSD endurance and write performance.
  • the processor 1 10 maintains Metadata 160 that maps every data block's LBA (Logical Block Address) to its PBA (Physical Block Address).
  • LBA is the virtual block address assigned and accessed by the application 140 whereas PBA represents the block's physical location in the RAID system 150.
  • the metadata 1 60 may map every data block's LBA to its content ID (e.g. content fingerprint) and every content ID to its PBA.
  • Fig. 8 illustrates one embodiment wherein the storage devices 1 20 in the RAID system 150 are managed as one or more Beads 186. Each Bead comprises one or more contiguous RAID stripes 184. Each RAID stripe comprises one chunk 182 from each storage device.
  • the processor 1 1 0 is configured to write to fill one or more Beads (current Beads) before writing to new Beads.
  • Fig. 9 illustrates the write data flow:
  • Step 710 Upon receiving an application write request, the processor 1 1 0 commits the write data in the NVRAM 130 and acknowledges completion to the application 140.
  • the processor 1 10 accumulates one or more data blocks in the
  • Step 720 The processor 1 10 checks if the current Bead is filled
  • Step 730 If not the processor 1 10 writes the full RAID stripe in one or more transfers after the existing RAID stripes in the current Bead;
  • Step 740 If yes the processor 1 1 0 writes the RAID stripe in one or more transfers at the beginning of a new Bead;
  • Step 750 The processor 1 10 updates the metadata 160 to map the LBA of each data block in the RAID stripe to its PBA;
  • Step 760 At a later time the processor 1 10 deletes the data blocks from the NVRAM so their NVRAM space can be reused
  • the above write data flow ensures that the RAID system 150 receives mostly full stripe write (FSW) requests, which cause fewer or no read-modify-write operations. It also ensures that most data is written to each SSD in contiguous chunks (sequential writes), which reduces the SSD's write amplification.
  • FSW stripe write
  • the processor 1 10 makes metadata 160 persistent by writing metadata updates to the RAID system 150.
  • Metadata updates are typically of small transfer sizes and are another source for read-modify-writes and SSD write amplification.
  • Fig. 10 illustrates the metadata update data flow for reducing read-modify- writes and write amplification:
  • Step 810 The processor 1 10 commits metadata update in the NVRAM 1 30;
  • Step 820 The processor 1 10 accumulates one or more metadata updates in the NVRAM into one RAID stripe 184;
  • Step 830 The processor 1 10 writes the RAID stripe in one or more transfers to the current Bead or a new Bead;
  • Step 840 The processor 1 10 updates the metadata index to the PBA of the on- disk metadata structure;
  • the metadata updates have their own Beads separate from data Beads. In other embodiments the metadata updates are mixed with data in the same Beads.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Security & Cryptography (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

A method of preventing data loss is disclosed that includes writing data first to a NVRAM device, then to a storage device with volatile memory and persistent medium, and checking whether data is on the persistent medium before deleting the data from the NVRAM. A method of reducing read-modify-writes and write amplification is also disclosed that includes writing data first to a NVRAM device, accumulating data in full stripes before writing the data to a RAID system in full stripes and to each storage device in contiguous chunks sequentially.

Description

NVRAM ENABLED STORAGE SYSTEMS
PRIORITY CLAIM
The present application claims priority to US Provisional Application No.
62/041 ,318 filed on August 25, 2014.
TECHNICAL FIELD
The present disclosure relates generally to storage systems, and, more specifically, to non-volatile random access memory (NVRAM) enabled storage systems.
BACKGROUND
Solid-state devices (SSDs), including solid-state drives and flash memory cards, are becoming increasingly popular for deployment in Redundant Array of Independent Disks (RAID) systems to support transactional applications such as databases. The RAID system may take the form of a hardware RAID card, RAID on a Chip, software RAID, Erasure Coding, or JBOD (Just a Bunch of Disks). Transactional applications typically issue read and write requests (I/O requests) that have small transfer sizes and are not in sequential block address order (collectively referred to as "random" I/O requests). SSDs typically service random read requests many times faster than traditional hard disk drives (HDDs). However, when SSDs are deployed in traditional RAID-5 and RAID-6 configurations, the random write performance is limited by the need to perform read-modify-write operations. The RAID system handles each random write request by reading existing data and parity, calculating a new parity, and then writing the new data and the new parity. These read-modify-write operations significantly reduce write performance and SSD endurance.
SSD write amplification also reduces write performance and SSD endurance. An SSD is comprised of a plurality of flash pages. An entire flash page must be "erased" before it can be rewritten. There is a write cycle limit to how many times a flash page can be erased and rewritten. When a transactional application writes to an SSD in a RAID system, its write request size will likely be much smaller than the SSD's flash page size, resulting in partially written flash pages. Consequently the SSD has to perform garbage collection by moving user data from one partially written flash page to another until an entire flash page contains no more user data and can be erased.
Garbage collection turns each application write into multiple SSD writes, also known as write amplification. Given the write cycle limit on each flash page, write amplification significantly reduces SSD endurance and write performance. When an application or a storage system writes to an SSD in multiple small transfer sizes in sequential block address order (sequential writes), the SSD typically can fill entire flash pages with fewer partially written pages, reducing the write amplification during garbage collection.
An SSD typically comprises a persistent flash medium for storing data and a volatile memory to hold data temporarily before the data is committed to the persistent flash medium. In the event of a power failure, the data stored in the volatile memory will be lost. To prevent data loss during power failure, some of the SSDs are equipped with a capacitor or battery, which provides enough power for flushing the data stored in the volatile memory to the persistent flash medium. However, the additional capacitor can significantly increase the cost of the SSDs.
There is a need to design a storage system that reduces read-modify-write operations, minimizes SSD write amplification, and prevents data loss without the expensive capacitors.
SUMMARY
Methods and apparatus for improving data storage systems are disclosed.
In some embodiments, a storage system is configured to prevent data loss in the event of power failure. The storage system comprises a processor, one or more storage devices and a non-volatile memory (NVRAM). Each of the one or more storage devices comprises a persistent medium. The NVRAM device is configured to store one or more data blocks to be sent to a storage device for persistent storage. The processor is configured to check whether a data block stored on the NVRAM is also stored on the storage device's persistent medium before deleting the data block from the NVRAM.
In some embodiments, a storage system is configured to reduce read-modify- write operations and write amplification. The storage system comprises a processor, a RAID system with one or more storage devices, a NVRAM device and a memory. The NVRAM device stores one or more data blocks that are to be sent to the RAID system for persistent storage. The memory stores a metadata that maps every data block's logical block address (LBA) to its physical block address (PBA). The processor is configured to handle random write requests from an application. When handling random write requests, the processor first stores a data block in the NVRAM, then
acknowledges to the application that the write request has been committed. Only after one or more data blocks in the NVRAM have been accumulated into a full RAID stripe, the full RAID stripe is written to the RAID system to reduce read-modify-write operations. After the full RAID stripe has been written to the RAID system, the metadata is updated to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks. The one or more data blocks are deleted from the NVRAM after the metadata has been updated.
Of course, the present invention is not limited to the features, advantages, and contexts summarized above, and those familiar with storage technologies will recognize additional features and advantages upon reading the following detailed description and upon viewing the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 i lustrates a block diagram of a storage system with NVRAM devices,
Figure 2 i lustrates a block diagram of a storage device with volatile memory and persistent medium.
Figure 3 lustrates the deferred write process at the NVRAM.
Figure 4 lustrates the deferred write process at the storage device.
Figure 5 lustrates a flow diagram of the check block persistent process.
Figure 6 lustrates a flow diagram of the make block persistent process.
Figure 7 lustrates a block diagram of a storage system with a RAID system.
Figure 8 lustrates a block diagram of RAID system data layout.
Figure 9 lustrates a flow diagram of NVRAM enabled writes to a RAID system.
Figure 10 illustrates a flow diagram of NVRAM enabled metadata updates.
DETAILED DESCRIPTION
A storage system is disclosed that has at least one NVRAM device to accomplish (1 ) preventing data loss in the event of a power failure; (2) reducing read-write-modify operations; and (3) reducing solid-state device write amplification.
Fig. 1 illustrates one embodiment of a storage system 100 that includes a processor 1 1 0 and one or more storage devices 120. Examples of storage device include solid-state device (SSD), hard disk drive (HDD), and a combination of SSDs and HDDs (Hybrid). The storage system 100 provides persistent storage to one or more user applications 140. In some embodiments, there may be multiple storage systems 100 implemented as a cluster for redundancy and performance. In some embodiments the storage device 120 may be accessible by multiple storage systems 100 as shared storage device. In some embodiments the application 140 and the storage system 100 may be running on the same physical system. In other embodiments the application 140 may access the storage system through a storage network such as FibreChannel, Ethernet, InfiniBand, and PCIe.
The processor 1 1 0 interfaces between the application 140 and the storage device 120. The processor 1 1 0 controls and manages the storage device 120. For example, the processor 1 1 0 may provide a set of commands for the application 140 to read from and write to the storage device 120. Also the processor 1 10 can provide redundancy, performance, and data services that often can't be achieved by the storage device 120.
In the present disclosure the storage system 100 includes one or more non- volatile random-access memory (NVRAM) devices 1 30. Examples of NVRAM include battery-backed DRAM, NVDIMM, PCIe NVRAM card, and solid-state device. In some embodiments, upon receiving a write request from the application 140, the processor 1 10 stores the write data in the NVRAM 130 and acknowledges to the application 140 that the write request is successful before the data is actually committed to the storage device 120. This process is known as deferred write.
Fig. 2 illustrates one embodiment of a storage device 1 20, such as a solid-state device (SSD), that comprises a persistent medium 172 for storing data and a volatile memory 174 for buffering data temporarily before the data is committed to the persistent medium 172. In the event of a power failure, the data stored in the volatile memory 174 will be lost. Traditionally, to prevent data loss during power failure, an SSD is equipped with a capacitor or battery, which provides enough power to write all the data in the volatile memory 174 to the persistent medium 172. However, the capacitor or battery can significantly increase the cost of the SSD.
The present disclosure provides methods for preventing data loss during power failure without the additional capacitor or battery in the SSD.
Fig. 3 illustrates the deferred write process at the NVRAM 130:
Step 1 : The processor 1 10 receives a write request from the application 140; Step 2: The processor 1 10 commits the write data to the NVRAM 130;
Step 3: The processor 1 10 acknowledges to the application 140 that the write is successful;
Step 4: At a later time the processor 1 10 writes the data in the NVRAM 130 to the storage device 120 (deferred write);
Step 5: At a later time the processor 1 10 deletes the data from the NVRAM 1 30 so the NVRAM space can be reused.
Fig. 4 illustrates the deferred write process at the storage device 120: Step 4': The storage device 120 receives a write request from the processor; Step 6: The storage device 1 20 stores the write data in its volatile memory 174; Step 7: The storage device 1 20 acknowledges to the processor 1 1 0 that the write is successful;
Step 8: At a later time the storage device 120 writes the data in the volatile memory 174 to its persistent medium 172
If a power failure takes place after step 5 but before step 8, the write data will be lost. In order to prevent data loss the present disclosure replaces step 5 with the following steps as illustrated in Fig. 5:
Step 510: The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its NVRAM space can be reused;
Step 520: The processor 1 10 checks if the data block is on the storage device's persistent medium 172 by issuing a "check block persistent" request to the storage device 120;
Step 530: If the storage device responds "yes", the processor 1 10 deletes the data block from the NVRAM 130.
Step 540: If the storage device responds "no", the processor 1 10 issues a "flush block" request to the storage device 120. Upon receiving the request, the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172 and acknowledges completion to the processor 1 10;
Step 550: Upon receiving the acknowledgement the processor 1 10 deletes the data block from the NVRAM 130.
Following the above steps ensures that in the event of a power failure the data can be recovered from either the NVRAM 1 30 or the storage device's persistent medium 172.
Fig. 6 illustrates another embodiment of the present disclosure by replacing step 5 with the following steps:
Step 610: The processor 1 10 selects a data block in the NVRAM 130 to be deleted so its space can be reused;
Step 620: The processor 1 10 issues "make block persistent" request to the storage device 120. Upon receiving the request, the storage device 120 checks if the data block is on its persistent medium. If not the storage device 120 writes the data block from its volatile memory 174 to its persistent medium 172. The storage device then acknowledges completion to the processor 1 1 0. Step 630: The processor 1 10 receives the completion acknowledgement for "make block persistent" request;
Step 640: The processor 1 10 deletes the data block from the NVRAM 130.
Following the above steps ensures that in the event of a power failure the data can be recovered from either the NVRAM 1 30 or the storage device's persistent medium 172.
In some embodiments the NVRAM device 130 is configured to be much larger than the aggregate size of the volatile memory in the storage devices 120. This allows the processor 1 10 to delay deleting a data from the NVRAM as much as possible so the data is more likely to have been flushed to the persistent medium 1 72 before a check block consistent or make block persistent request. In some embodiments the storage device 120 always writes data from its volatile memory 174 to its persistent medium 172 first in first out (FIFO). In these embodiments the processor may calculate whether a data is on the persistent medium without the check block persistent or make block persistent request.
Fig. 7 illustrates one embodiment of a storage system 100 that includes a RAID (Redundant Array of Independent Disks) system 150 between the processor 1 10 and the storage device 120. Examples of RAID system include software RAID, hardware RAID card, RAID on a chip, Erasure Coding, or JBOD (Just a Bunch of Disks). The RAID system may be configured in write through (WT) or write back (WB) mode. The RAID system 150 virtualizes multiple storage devices 120 into logical units. The RAID system 150 may be implemented to distribute data blocks across the storage devices 120 (i.e., striping) and generate checksums (i.e., parity) for data redundancy and recovery. However, under transactional workload the RAID system introduces read- modify-write operations that reduce write performance. In some embodiments the storage devices 120 are SSDs. Small transactional writes to a SSD causes write amplification, which reduces SSD endurance and write performance.
The present disclosure provides methods for reducing read-modify-write operations and SSD write amplification. In one embodiment the processor 1 10 maintains Metadata 160 that maps every data block's LBA (Logical Block Address) to its PBA (Physical Block Address). LBA is the virtual block address assigned and accessed by the application 140 whereas PBA represents the block's physical location in the RAID system 150. In some embodiments the metadata 1 60 may map every data block's LBA to its content ID (e.g. content fingerprint) and every content ID to its PBA. Fig. 8 illustrates one embodiment wherein the storage devices 1 20 in the RAID system 150 are managed as one or more Beads 186. Each Bead comprises one or more contiguous RAID stripes 184. Each RAID stripe comprises one chunk 182 from each storage device. The processor 1 1 0 is configured to write to fill one or more Beads (current Beads) before writing to new Beads. Fig. 9 illustrates the write data flow:
Step 710: Upon receiving an application write request, the processor 1 1 0 commits the write data in the NVRAM 130 and acknowledges completion to the application 140. The processor 1 10 accumulates one or more data blocks in the
NVRAM 130 into a full RAID stripe 184;
Step 720: The processor 1 10 checks if the current Bead is filled;
Step 730: If not the processor 1 10 writes the full RAID stripe in one or more transfers after the existing RAID stripes in the current Bead;
Step 740: If yes the processor 1 1 0 writes the RAID stripe in one or more transfers at the beginning of a new Bead;
Step 750: The processor 1 10 updates the metadata 160 to map the LBA of each data block in the RAID stripe to its PBA;
Step 760: At a later time the processor 1 10 deletes the data blocks from the NVRAM so their NVRAM space can be reused
The above write data flow ensures that the RAID system 150 receives mostly full stripe write (FSW) requests, which cause fewer or no read-modify-write operations. It also ensures that most data is written to each SSD in contiguous chunks (sequential writes), which reduces the SSD's write amplification.
In some embodiments the processor 1 10 makes metadata 160 persistent by writing metadata updates to the RAID system 150. Metadata updates are typically of small transfer sizes and are another source for read-modify-writes and SSD write amplification. Fig. 10 illustrates the metadata update data flow for reducing read-modify- writes and write amplification:
Step 810: The processor 1 10 commits metadata update in the NVRAM 1 30;
Step 820: The processor 1 10 accumulates one or more metadata updates in the NVRAM into one RAID stripe 184;
Step 830: The processor 1 10 writes the RAID stripe in one or more transfers to the current Bead or a new Bead;
Step 840: The processor 1 10 updates the metadata index to the PBA of the on- disk metadata structure; Step 850: The processor 1 10 deletes the metadata updates from the NVRAM 130 so their NVRAM space can be reused.
In some embodiments the metadata updates have their own Beads separate from data Beads. In other embodiments the metadata updates are mixed with data in the same Beads.
The foregoing description and the accompanying drawings represent non-limiting examples of the methods and apparatus taught herein. As such, the present disclosure is not limited by the foregoing description and accompanying drawings. Instead, the present disclosure is limited only by the following claims and their legal equivalents.

Claims

What is claimed is:
1 . A storage system configured to prevent data loss in the event of power failure, the storage system comprising:
a processor;
one or more storage devices, each of the one or more storage devices
comprising a volatile memory and a persistent medium; and a non-volatile memory (NVRAM) device configured to store one or more data blocks to be sent to a storage device for persistent storage;
wherein the processor is characterized by checking whether a data block stored on the NVRAM device is also stored on the storage device's persistent medium before deleting the data block from the NVRAM.
2. The storage system of claim 1 , wherein the processor is configured to request that the storage device make the data block persistent if the data block is not stored on its persistent medium.
3. The storage system of claim 1 , wherein the processor is configured to make a data block persistent before deleting it from the NVRAM by issuing a "make block persistent" request to the storage device, wherein the storage device handles the "make block persistent" request by:
checking whether the data block is stored in the storage device's volatile
memory;
if the data block is stored in its volatile memory, transferring the data block to its persistent medium; and
acknowledging to the processor that the data block is stored on the persistent medium.
4. The storage system of claim 1 , wherein the NVRAM is a battery-backed DRAM, NVDIMM, NVRAM card, solid-state device, or other non-volatile memory device.
5. The storage system of claim 1 , wherein the NVRAM is mirrored to one or more storage systems for redundancy.
6. The storage system of claim 1 , wherein the one or more storage devices are solid-state devices (SSDs), traditional hard disk drives (HDDs), a combination of SSD and HDDs, or other persistent medium devices. 7. The storage system of claim 1 , wherein the one or more storage devices are deployed in a Redundant Array of Inexpensive Disks (RAID) system; wherein the RAID system takes the form of a software RAID, hardware RAID card, RAID on a chip, or other RAID technologies. 8. The storage system of claim 1 , wherein the one or more storage devices are deployed in an erasure coding system; wherein the erasure coding system takes the form of a software, hardware card, a chip, or other erasure coding technologies.
9. The storage system of claim 1 , wherein the one or more storage devices are deployed in a Just-a-Bunch-of-Disks (JBOD) system; wherein the JBOD system takes the form of a software, hardware card, a chip, or other JBOD technologies.
10. The storage system of claim 1 , where the storage devices can be accessed from one or more storage systems.
1 1 . A storage system configured to reduce read-modify-write operations and write amplifications, the storage system comprising:
a processor; and
a Redundant Array Inexpensive Disk (RAID) system with one or more storage devices; and
a non-volatile-memory (NVRAM) device, wherein the NVRAM device stores one or more data blocks to be sent to the RAID system for persistent storage;
a memory that stores a metadata, wherein the metadata maps the one or more data blocks' logical block addresses (LBAs) to their physical block addresses (PBAs);
wherein the processor is configured to handle random write requests from an application by:
storing a data block in the NVRAM device; acknowledging to the application that the write request has been
committed;
accumulating one or more data blocks in the NVRAM into a full RAID stripe;
writing the full RAID stripe to the RAID system to reduce read-modify- write operations;
updating the metadata to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks;
deleting the one or more data blocks from the NVRAM device.
12. The storage system of claim 1 1 , wherein the processor is configured to make the metadata updates persistent by:
storing each metadata update in the NVRAM device;
accumulating one or more metadata updates with or without data blocks in the NVRAM device into full RAID stripes;
writing full RAID stripes to the RAID system to reduce read-modify-write
operations;
updating metadata index with on-disk metadata PBAs; and
deleting the metadata updates from the NVRAM device so the NVRAM device.
13. The storage system of claim 1 1 , wherein the RAID system is operating in a write- through (WT) mode.
14. The storage system of claim 1 1 , wherein one or more data and metadata blocks are written to each storage device in contiguous chunks sequentially to reduce the storage device's write amplification.
15. The storage system of claim 1 1 , wherein the NVRAM is a battery-backed DRAM, NVDIMM, NVRAM card, solid-state device, or other non-volatile memory device.
16. The storage system of claim 1 1 , wherein the NVRAM is mirrored to one or more storage systems for redundancy.
17. The storage system of claim 1 1 , wherein the one or more storage devices are solid-state devices (SSDs), traditional hard disk drives (HDDs), a combination of SSDs and HDDs (hybrid), or other persistent medium devices. 18. The storage system of claim 1 1 , wherein the one or more storage devices are deployed in a RAID (Redundant Array of Inexpensive Disks) system; wherein the RAID system takes the form of a software RAID, hardware RAID card, RAID on a chip, or other RAID technologies. 19. The storage system of claim 1 1 , wherein the one or more storage devices are deployed in an erasure coding system; wherein the erasure coding system takes the form of a software, hardware card, a chip, or other erasure coding technologies.
20. The storage system of claim 1 1 , where the one or more storage devices are deployed in a Just-a-Bunch-of-Disks (JBOD) system; wherein the JBOD system takes the form of a software, hardware card, a chip, or other JBOD technologies.
21 . The storage system of claim 1 1 , where the storage devices can be accessed from one or more storage systems.
22. A method implemented in a storage system for preventing data loss, said storage system comprising a processor, one or more storage devices, and a non-volatile memory (NVRAM) device, said one or more storage devices comprising a persistent medium and a volatile memory, said method is characterized by:
sending data stored in the NVRAM device to the one or more storage devices; and
checking whether a data block stored on the NVRAM device is also stored on the storage device's persistent medium before deleting the data from the NVRAM.
23. The method of claim 22, further comprising:
requesting that the storage device make the data block persistent if the data block is not stored on the persistent medium. The method of claim 22, further comprising:
issuing a "make persistent" request to the storage device to make a data block persistent before deleting it from the NVRAM; wherein the storage device handles the "make persistent request by:
checking whether the data block is stored in the storage device's volatile memory;
if the data block is stored in its volatile memory, transferring the data block to its persistent medium; and
acknowledging to the processor that the data block is stored on the persistent medium.
25. A method implemented in a storage system for reducing read-modify-write operations and write amplifications, said storage system comprising a processor, a RAID system with one or more storage devices, and a non-volatile memory (NVRAM) device, and a memory that stores a metadata, said method comprising:
receiving a write request from an application to write a data block to a storage device;
storing the data block on the NVRAM device; and
acknowledging to the application that the write request has been committed; wherein the method is further characterized by:
accumulating one or more data blocks in the NVRAM into a full RAID stripe;
writing the full RAID stripe to the RAID system to reduce read-modify- write operations;
updating the metadata to map the LBAs of the one or more data blocks to the PBAs of the one or more data blocks; and
deleting the one or more data blocks from the NVRAM device.
The method of claim 25, further comprising:
storing each metadata update in the NVRAM device;
accumulating one or more metadata updates with or without data blocks in the
NVRAM device into full RAID stripes;
writing full RAID stripes to the RAID system to reduce read-modify-write
operations;
updating metadata index with on-disk metadata PBAs; and deleting the metadata updates from the NVRAM so that the NVRAM device can be used for new metadata updates.
27. The method of claim 25, wherein the RAID system is operating in a write-through (WT) mode.
28. The method of claim 25, wherein one or more data and metadata blocks are written to each storage device in contiguous chunks sequentially to reduce the storage device's write amplification.
29. A storage system configured to prevent data loss in the event of power failure, the storage system comprising:
a processor;
one or more storage devices, each of the one or more storage devices
comprising a persistent medium and a volatile memory; and a non-volatile memory (NVRAM) device configured to store one or more data blocks to be sent to a storage device for persistent storage;
wherein the NVRAM device is characterized by being configured to be larger than the aggregate size of the volatile memory, and wherein the processor is characterized by being configured to delay deleting a data from the NVRAM to allow the data to be flushed to the persistent medium before a check block consistent request or a make block persistent request is received. 30. The storage system of claim 29, wherein each of the one or more storage devices is configured to write data from the volatile memory to the persistent medium first in first out (FIFO) and wherein the processor is configured to calculate whether a data is on the persistent medium of a storage device without issuing the check block persistent or make block persistent request.
31 . A method of preventing data loss in the event of power failure, said method being implemented in a storage system that comprises a processor, one or more storage devices, and a non-volatile memory (NVRAM) device, said method comprising:
configuring the NVRAM device to be larger than the aggregate size of the volatile memory; storing one or more data blocks in the NVRAM device; and
sending the one or more data blocks stored in the NVRAM device to a storage device for persistent storage;
said method further characterized by:
delaying deleting a data from the NVRAM to allow the data to be flushed to the persistent medium before a check block consistent request or a make block persistent request is received.
32. The method of claim 31 , wherein the one or more storage devices are configured to write data from the volatile memory to the persistent medium first in first out (FIFO), and wherein the processor is configured to calculate whether a data is on the persistent medium of a storage device without issuing the check block persistent or make block persistent request.
PCT/US2015/046534 2014-08-25 2015-08-24 Nvram enabled storage systems WO2016032955A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462041318P 2014-08-25 2014-08-25
US62/041,318 2014-08-25

Publications (2)

Publication Number Publication Date
WO2016032955A2 true WO2016032955A2 (en) 2016-03-03
WO2016032955A3 WO2016032955A3 (en) 2016-04-21

Family

ID=55400802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/046534 WO2016032955A2 (en) 2014-08-25 2015-08-24 Nvram enabled storage systems

Country Status (1)

Country Link
WO (1) WO2016032955A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106356097A (en) * 2016-08-25 2017-01-25 浙江宇视科技有限公司 Protection method and device for preventing data loss
WO2018022091A1 (en) * 2016-07-29 2018-02-01 Hewlett-Packard Development Company, L.P. Unlocking machine-readable storage devices using a user token
CN114201115A (en) * 2021-12-14 2022-03-18 北京达佳互联信息技术有限公司 Data storage system, method, computer device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8074019B2 (en) * 2007-11-13 2011-12-06 Network Appliance, Inc. Preventing data loss in a storage system
US7761740B2 (en) * 2007-12-13 2010-07-20 Spansion Llc Power safe translation table operation in flash memory
US10346095B2 (en) * 2012-08-31 2019-07-09 Sandisk Technologies, Llc Systems, methods, and interfaces for adaptive cache persistence
US9081712B2 (en) * 2012-12-21 2015-07-14 Dell Products, L.P. System and method for using solid state storage systems as a cache for the storage of temporary data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018022091A1 (en) * 2016-07-29 2018-02-01 Hewlett-Packard Development Company, L.P. Unlocking machine-readable storage devices using a user token
CN106356097A (en) * 2016-08-25 2017-01-25 浙江宇视科技有限公司 Protection method and device for preventing data loss
CN106356097B (en) * 2016-08-25 2020-02-14 浙江宇视科技有限公司 Protection method and device for preventing data loss
CN114201115A (en) * 2021-12-14 2022-03-18 北京达佳互联信息技术有限公司 Data storage system, method, computer device and storage medium

Also Published As

Publication number Publication date
WO2016032955A3 (en) 2016-04-21

Similar Documents

Publication Publication Date Title
US9781227B2 (en) Lockless distributed redundant storage and NVRAM caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect
US20190073296A1 (en) Systems and Methods for Persistent Address Space Management
EP2598996B1 (en) Apparatus, system, and method for conditional and atomic storage operations
US8898376B2 (en) Apparatus, system, and method for grouping data stored on an array of solid-state storage elements
US10127166B2 (en) Data storage controller with multiple pipelines
US8495284B2 (en) Wear leveling for low-wear areas of low-latency random read memory
JP6208156B2 (en) Replicating a hybrid storage aggregate
US8782344B2 (en) Systems and methods for managing cache admission
US10019320B2 (en) Systems and methods for distributed atomic storage operations
US9251052B2 (en) Systems and methods for profiling a non-volatile cache having a logical-to-physical translation layer
US10019352B2 (en) Systems and methods for adaptive reserve storage
EP2802991B1 (en) Systems and methods for managing cache admission
US9134917B2 (en) Hybrid media storage system architecture
US20130166820A1 (en) Methods and appratuses for atomic storage operations
US20180081821A1 (en) Metadata Management in a Scale Out Storage System
US10810123B1 (en) Flush strategy for using DRAM as cache media system and method
US20140006685A1 (en) Systems, methods, and interfaces for managing persistent data of atomic storage operations
US20150095696A1 (en) Second-level raid cache splicing
US20210311652A1 (en) Using Segment Pre-Allocation to Support Large Segments
US20210311653A1 (en) Issuing Efficient Writes to Erasure Coded Objects in a Distributed Storage System with Two Tiers of Storage
US11467746B2 (en) Issuing efficient writes to erasure coded objects in a distributed storage system via adaptive logging
WO2016032955A2 (en) Nvram enabled storage systems
US8402247B2 (en) Remapping of data addresses for large capacity low-latency random read memory
US20180307419A1 (en) Storage control apparatus and storage control method
US11314809B2 (en) System and method for generating common metadata pointers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15834910

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 26.06.2017

122 Ep: pct application non-entry in european phase

Ref document number: 15834910

Country of ref document: EP

Kind code of ref document: A2