CN118259829A

CN118259829A - Apparatus and method for accessing data at a storage node

Info

Publication number: CN118259829A
Application number: CN202311813648.6A
Authority: CN
Inventors: 奇亮奭; 朴赞益; 柳星旭
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2022-12-27
Filing date: 2023-12-26
Publication date: 2024-06-28

Abstract

An apparatus and method for accessing data at a storage node is provided. The apparatus may include a storage node comprising: a first interface for communicating with a first memory medium, a second interface for communicating with a second memory medium, and at least one control circuit configured to: the method includes transmitting location information of data stored in a first memory medium from a storage node and transmitting the data from the storage node using a memory access scheme. The at least one control circuit may be configured to: at least a portion of the first memory medium is operated as a cache of at least a portion of the second memory medium. The at least one control circuit may be configured to: the location information is transmitted using a memory access scheme. The at least one control circuit may be configured to: updating the location information to generate updated location information, and performing transmission of the updated location information from the storage node.

Description

Apparatus and method for accessing data at a storage node

Technical Field

The present disclosure relates generally to accessing data, and more particularly, to systems, methods, and devices for accessing data from memory or storage devices at storage nodes.

Background

The storage node may include one or more storage devices configured to store data. The storage node may process requests to access one or more storage devices. For example, the storage node may process the write request by storing the write data in at least one of the one or more storage devices. As another example, the storage node may process the read request by retrieving the requested data from at least one of the one or more storage devices and returning the retrieved data and a response to the read request.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention principles and therefore it may contain information that does not form the prior art.

Disclosure of Invention

An apparatus, the apparatus comprising: a storage node, comprising: a first interface for communicating with a first memory medium; a second interface for communicating with a second memory medium; and at least one control circuit configured to: transmitting location information of data stored in the first memory medium from the storage node; and transmitting the data from the storage node using a memory access scheme.

In one embodiment, wherein the at least one control circuit is configured to: at least a portion of the first memory medium is operated as a cache for at least a portion of the second memory medium.

In one embodiment, wherein the at least one control circuit is configured to: the location information is transmitted using a memory access scheme.

In one embodiment, wherein the at least one control circuit is configured to: receiving a request for the location information; and transmitting the location information based on the request.

In one embodiment, wherein the at least one control circuit is configured to: updating the location information to generate updated location information; and performs transmission of the updated location information from the storage node.

In an embodiment, wherein the transmission of the updated location information is caused by the storage node.

In one embodiment, wherein the at least one control circuit is configured to: receiving a request to transmit the data; and transmitting the data from the storage node using the memory access scheme based on the request.

In one embodiment, wherein the request to transfer the data comprises a command.

In one embodiment, wherein the storage node comprises a network adapter; and the network adapter includes at least a portion of a memory access scheme.

An apparatus, the apparatus comprising: a node comprising at least one control circuit configured to: transmitting data from the node; receiving location information of the data at the node; and transmitting the data to the node using a memory access scheme based on the location information.

In one embodiment, wherein the location information identifies a memory medium.

In one embodiment, wherein the location information identifies a location within the memory medium.

In one embodiment, wherein the location information identifies a cache for the data.

In one embodiment, wherein the at least one control circuit is configured to: transmitting a request for the location information from the node; and receiving the location information at the node based on the request.

In one embodiment, wherein the at least one control circuit is configured to: and storing a data structure comprising the position information.

In one embodiment, wherein the at least one control circuit is configured to: receiving updated location information at the node; and modifying the data structure based on the updated location information.

In one embodiment, wherein the node comprises a network adapter; and the network adapter includes at least a portion of a memory access scheme.

In one embodiment, wherein the at least one control circuit is configured to: the data is communicated to the node based on a request for a memory access scheme.

A method, the method comprising: receiving data at a first node; storing at least a portion of the data in a cache at a first node; transmitting location information of the at least a portion of the data from the first node to the second node; and transmitting the at least a portion of the data from the cache to a second node using a memory access scheme based on the location information.

In an embodiment, wherein the step of transmitting the location information is performed using a memory access scheme.

Drawings

The figures are not necessarily to scale and elements of similar structure or function may generally be represented by like reference numerals or parts of reference numerals throughout the figures for illustrative purposes. The drawings are only intended to facilitate the description of the various embodiments described herein. The drawings do not depict every aspect of the teachings disclosed herein and do not limit the scope of the claims. In order to prevent the drawing from becoming obscure, not all components, connections, etc. may be shown and not all components have reference numerals. However, the pattern of the component configuration can be readily made clear from the drawings. The accompanying drawings illustrate example embodiments of the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates an embodiment of a storage node and associated method according to a disclosed example embodiment.

FIG. 2 illustrates an example embodiment of a method and apparatus for accessing data at a storage node in accordance with the disclosed example embodiments.

Fig. 3 illustrates an embodiment of a scheme and/or memory access scheme for accessing data at a storage node using location information, according to a disclosed example embodiment.

FIG. 4 illustrates an embodiment of a scheme for accessing data at a storage node using a mirrored data structure in accordance with a disclosed example embodiment.

Fig. 5 illustrates an example embodiment of a storage node and method for using location information and/or a memory access scheme in accordance with the disclosed example embodiments.

Fig. 6 illustrates an example embodiment of a method and apparatus for accessing data at a storage node using location information and/or a memory access scheme, according to an example embodiment of the disclosure.

Fig. 7 illustrates an embodiment of a memory device according to a disclosed example embodiment.

Fig. 8 illustrates an example embodiment of a node device according to an example embodiment of the disclosure.

Fig. 9 illustrates an example embodiment of a storage device according to an example embodiment of the disclosure.

Fig. 10 illustrates an embodiment of a method for accessing data at a storage node in accordance with a disclosed example embodiment.

Detailed Description

The storage node may include one or more storage devices configured to store data. The storage node may also include one or more processors (e.g., central Processing Units (CPUs)) that may implement an input and/or output (I/O or IO) stack to process requests to access the storage. The storage node may also include one or more types of caches that may improve access latency by storing copies of data stored in the storage device in relatively fast types of memory. Read requests received at a storage node may proceed through the IO stack for further processing by the CPU, which may retrieve the requested data from a cache (e.g., cache hit) or from a storage device (e.g., cache miss). The CPU may send back a response to the request and/or the retrieved data through the IO stack.

However, processing a request with an IO stack may result in relatively high latency, for example, because the request may progress through one or more successive layers of the IO stack. For example, delay may be particularly disadvantageous for requests accessing relatively small data payloads.

A storage node according to disclosed example embodiments may provide location information to enable a user to determine one or more locations (e.g., one or more cache locations) where data may be stored at the storage node. Using the location information, the user can access the data in a manner that reduces latency, improves bandwidth, etc., depending on implementation details. For example, a user may access data from a cache using a memory access scheme (such as a Remote Direct Memory Access (RDMA) protocol), which may bypass some or all of the IO stacks depending on implementation details.

The storage node may provide location information using various techniques according to disclosed example embodiments. For example, in some embodiments, a storage node may use a data structure (such as a hash table) to track one or more cache locations where data may be stored. The storage node may enable a user to access the data structure, for example, using a memory access scheme (such as RDMA). Additionally or alternatively, the storage node may communicate at least a portion of the data structure to the user and/or maintain at least a portion of the data structure at the user. This may enable a user to determine location information without accessing the data structure at the storage node, which may further reduce latency, increase bandwidth, etc., depending on implementation details.

In some embodiments, the storage node may use the first storage device as a cache for the additional storage devices. For example, the storage node may use a Solid State Drive (SSD) as a cache (e.g., a flash cache) for a Hard Disk Drive (HDD). In such embodiments, the storage node may enable a user to access data located at the cache storage using a protocol, such as non-volatile memory express over structure (NVMe) (NVMe-orf), which may use, for example, RDMA as an underlying transport scheme.

The present disclosure encompasses many principles related to accessing data at a storage node. The principles disclosed herein may have independent utility and may be implemented separately and not every embodiment may utilize every principle. Furthermore, the principles may be implemented in various combinations, some of which may amplify some of the benefits of the various principles in a synergistic manner.

FIG. 1 illustrates an embodiment of a storage node and associated method according to a disclosed example embodiment. The storage node 102 shown in fig. 1 may include a CPU 104, a communication interface 106, and one or more interfaces 108A, 108B, 108C, 108D, … … configured to communicate with one or more memory media 110A, 110B, 110C, 110D, … …, respectively (one or more interfaces 108A, 108B, 108C, 108D, … … may be collectively and/or individually referred to as 108, and one or more memory media 110A, 110B, 110C, 110D, … … may be collectively and/or individually referred to as 110.)

The memory medium 110 is not limited to any particular type of memory medium. For example, one or more of the memory media 110 may be implemented with volatile memory media, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), etc., or any combination thereof. As another example, one or more of the memory media 110 may be implemented with non-volatile memory media including solid-state media, magnetic media, optical media, and the like, or any combination thereof. Examples of solid state media may include flash memory (such as NAND flash memory), persistent memory (PMEM) (such as cross-grid non-volatile memory), memory with bulk resistance change, phase Change Memory (PCM), and the like, or any combination thereof.

The memory medium 110 is not limited to any particular physical configuration, form factor, or the like. For example, one or more of the memory media 110 may be configured as an integrated circuit (e.g., with solder, socket, etc.) attached to a circuit board. As another example, one or more of the memory media 110 may be configured as modules, adapter cards, or the like (such as single in-line memory modules (SIMMs) or dual in-line memory modules (DIMMs), peripheral component interconnect express (PCIe) interposer cards, or the like, connected to a circuit board using connectors). As a further example, one or more of the memory media 110 may be configured as a storage device in any form factor (such as any of 3.5 inches, 2.5 inches, 1.8 inches, m.2, enterprise and data center SSD form factors (EDSFF), SFF-TA-100X form factors (e.g., SFF-TA-1002), NF1, etc.) using any connector configuration (such as serial ATA (SATA), small Computer System Interface (SCSI), serial Attached SCSI (SAS), m.2, U.2, U.3, etc.).

The interfaces 108 are not limited to any particular type of interface and may be implemented based on the type of memory medium they may use. For example, one or more of interfaces 108 may be implemented with any generation of Double Data Rate (DDR) interfaces (e.g., DDR4, DDR5, etc.), open Memory Interfaces (OMI), etc. As another example, one or more of interfaces 108 may be implemented with interconnect interfaces and/or protocols such as PCIe, non-volatile memory express (NVMe), NVMe key (NVMe-KV), SATA, SAS, SCSI, computing fast link (CXL) and/or one or more CXL protocols such as cxl.mem, cxl.cache and/or cxl.io, gen-Z, coherent Accelerator Processor Interface (CAPI), cache coherent interconnect for accelerator (CCIX), and the like. As further examples, one or more oF interfaces 108 may be implemented with a network interface and/or protocol (such as ethernet, transmission control protocol/internet protocol (TCP/IP), remote Direct Memory Access (RDMA), RDMA Over Converged Ethernet (ROCE), fibre channel, infiniBand (IB), internet Wide Area RDMA Protocol (iWARP), NVMe over fabric (NVMe-orf), etc., or any combination thereof.

Although memory medium 110 and interface 108 are not limited to any particular type, for purposes of illustration, interface 108 and memory medium 110 may be implemented with the following example memory medium and/or interface as shown in fig. 1. Memory medium 110A may be implemented with DRAM (e.g., as a DIMM module) and interface 108A may be implemented with a DDR interface. Memory medium 110B may be implemented with PMEM (e.g., cross-grid non-volatile memory) and interface 108B may be implemented with a DDR interface. The memory medium 110C may be implemented with NAND flash memory configured as a storage device (e.g., SSD), and the interface 108C may be implemented with NVMe protocol using a PCIe interface. (alternatively, memory medium 110C may be implemented with NAND flash memory configured as a storage device (e.g., SSD) and interface 108C may be implemented with NVMe-oF protocol using RDMA as the underlying transport.) memory medium 110D may be implemented with magnetic medium configured as a storage device (e.g., HDD) and interface 108D may be implemented with a SAS interface. Further, although one of each type of memory medium 110 may be shown in fig. 1, some embodiments may include multiple instances of each type of memory medium 110 and/or interface 108, fewer memory media 110 and/or interfaces 108, and/or additional types of memory media 110 and/or interfaces 108.

Storage node 102 is not limited to any particular physical form. For example, storage node 102 may be implemented in whole or in part with and/or used in combination with one or more personal computers, servers, server enclosures, server racks, data rooms, data centers, edge data centers, mobile edge data centers, and/or any combination thereof.

The CPU 104 may be implemented with one or more processing circuits (e.g., to enable the CPU 104 to operate as one or more control circuits) having one or more cores 105 executing instructions stored in any type of memory or any combination thereof, the one or more cores 105 may be based on, for example, one or more Complex Instruction Set Computer (CISC) processors (such as an x86 processor) and/or Reduced Instruction Set Computer (RISC) processors (such as an ARM processor), a Graphics Processor (GPU), a Neural Processor (NPU), a Tensor Processor (TPU), etc. The CPU 104 may also include any type of circuitry including combinational logic, sequential logic, one or more timers, counters, registers and/or state machines, one or more Complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), etc. to implement one or more functions, features, etc. (e.g., to operate as one or more control circuits).

The communication interface 106 may be implemented with any type of interconnect interface including the above-mentioned interconnect interfaces, network interfaces including the above-mentioned network interfaces, and the like, or any combination thereof. The CPU104 may implement the IO stack 112 (e.g., as part of an operating system (e.g., linux) kernel run by the CPU). The IO stack 112 may enable the CPU104 and/or one or more applications, processes, services, etc. running on the CPU104 to communicate via the communication interface 106. For example, in embodiments where communication interface 106 may be implemented with an ethernet interface, IO stack 112 may implement one or more layers including a programming socket layer, a TCP layer, an IP layer, a driver layer, and the like.

In some embodiments, the CPU may configure and/or operate a portion of one of the memory media 110 as a cache for a portion of another one of the memory media 110. For example, in some embodiments, one or more HDDs 110D may be configured and/or operated as a primary storage medium (which may also be referred to as a primary storage medium or an underlying storage medium) of a storage node, and all or a portion of each of DRAMs 110A, PMEM B and/or SSDs 110C may be configured and/or operated as a cache for all or a portion of the primary storage medium. For example, flash-based SSD 110C may operate as a flash cache for HDD 110D.

In some embodiments, all or a portion of each of DRAM 110A, PMEM B and/or SSD 110C may be configured and/or operated to provide various different types of caches for the primary storage medium. Furthermore, in some embodiments, one type of memory medium 110 may include an internal cache operable as a cache within a cache. For example, SSD 110C can include a NAND flash memory primary storage medium and a DRAM cache operable as a cache for the NAND flash memory primary storage medium. In some embodiments, one or more of the memory media 110 may be configured and/or operated in a hierarchical manner. For example, SSD 110C may be configured and/or operated as a relatively large but slower second level cache for HDD 110D, and PMEM110B may be configured and/or operated as a relatively small but faster first level cache for SSD 110C.

Within the storage node 102, the CPU 104 may also internally implement a data structure (such as a hash table 114) to enable the CPU 104 to track the location of data throughout the various memory media 110 of the storage node 102. For example, in embodiments where the HDD 110D may be configured and/or operated as a primary storage medium, the storage node may receive a request to read data from the HDD 110D. The request may specify data in the form of Logical Block Addresses (LBAs), ranges of LBAs, data objects, keys of key-value pairs, and the like. The CPU 104 may look up an entry (e.g., one or more LBAs, objects, key-value peers) in the hash table 114 for the requested data. If the hash table includes an entry for the requested data, this may indicate that a copy of the requested data is stored in the cache (such as in DRAM 110A, PMEM B and/or SSD 110C). The CPU may use the hash table entry to retrieve the requested data from the location with the lowest latency (e.g., from DRAM 110A if located in DRAM 110A, from PMEM 110B if not located in DRAM 110A, or from SSD 110C if not located in PMEM 110B).

If hash table 114 does not include an entry for the requested data, this may indicate that a copy of the requested data is not stored on any memory medium 110 configured as a cache (e.g., DRAM 110A, PMEM B or SSD 110C), and thus, CPU 104 may retrieve the requested data from the primary storage medium in HDD 110D.

The form of the entries in hash table 114 may depend on the type of memory medium 110 that it may reference. For example, an entry in hash table 114 for data cached in DRAM 110A or PMEM 110B may be in the form of a pointer to a memory address, and thus, CPU104 may access the requested data using load and/or store operations in the memory space to which DRAM 110A or PMEM 110B may be mapped. As another example, the entry in the hash table 114 for the data cached in the SSD 110C may be in the form of an LBA within the SSD 110C, and thus, the CPU104 may access the requested data, for example, by sending an NVMe command for a read operation to the SSD 110C. Thus, in some embodiments, hash table 114 may be implemented with LBAs as inputs and memory pointers or LBAs for a particular memory medium 110 as outputs. Alternatively or additionally, hash table 114 may be implemented with any of LBAs, object identifiers, keys, etc. as input and memory pointers, LBAs, object identifiers, keys, etc. to particular memory media 110 as output.

Although hash table 114 may be conceptually illustrated as part of CPU 104, hash table 114 may be located anywhere that includes any of the internal memory (e.g., cache) and/or memory medium 110 (such as DRAM 110A) within CPU 104.

Example embodiments of a process for servicing a request to access (e.g., read) data stored at storage node 102 may proceed as follows. The storage node 102 may receive read requests 116 from users (such as additional nodes, client devices, servers, personal computers, tablet computers, smartphones, etc.) via the communication interface 106. Request 116 may be processed by IO stack 112 as indicated by arrow 117. The CPU core 105 may further process the request 116 by performing a lookup 118 using the hash table 114 based on the LBA provided with the request 116. In the example shown in fig. 1, the hash table lookup may return metadata 119 indicating that the requested data is located at an LBA within SSD 110C. CPU core 105 may access the requested data using LBAs within SSD 110C as indicated by arrow 120. If interface 108C is implemented with PCIe, then CPU core 105 may access SSD 110C, for example, using NVMe commands. Alternatively or additionally, if the interface 108C is implemented with a network (such as ethernet, infiniband, etc.), the CPU core 105 may use NVMe-oh commands to access the SSD 110C.

The CPU core 105 may read the requested data from the SSD 110C as indicated by arrow 121. The CPU core 105 may send a response 123 that may include, for example, the requested data. Response 123 may be processed by IO stack 112 as indicated by arrow 122 and sent to the user through communication interface 106.

Thus, the data requested by request 116 may travel through a data path that may include arrows 117, 120, 121, and/or 122. Further, the data path may proceed twice through the IO stack and may also include a CPU104 (e.g., one or more CPU cores 105). Depending on implementation details, such relatively long data paths may result in relatively long delays and/or relatively low bandwidths, which may be particularly disadvantageous, for example, when accessing relatively small data payloads.

FIG. 2 illustrates an example embodiment of a method and apparatus for accessing data at a storage node in accordance with the disclosed example embodiments. The embodiment shown in fig. 2 may be implemented, for example, using an embodiment similar to the storage node shown in fig. 1, wherein similar elements may be referred to by reference numerals ending in and/or containing the same numbers, letters, etc.

Referring to FIG. 2, the storage node may be implemented with a storage server 202, and the storage server 202 may include an HDD storage device 210D, PMEM B and a buffer cache 210C-1. Buffer cache 210C-1 may be implemented, for example, using volatile memory (such as DRAM) that may be configured as a cache within an SSD (such as SSD 110C shown in fig. 1). Alternatively or additionally, buffer cache 210C-1 may be implemented with flash memory (e.g., as a flash cache) within an SSD (such as SSD 110C shown in FIG. 1).

Storage server 202 may also include hash table 214, where hash table 214 is configured to track the location of data stored in PMEM 210B and/or buffer cache 210C-1, and either or both of PMEM 210B and/or buffer cache 210C-1 may be configured to cache data stored in HDD storage device 210D. The HDD storage device 210D may be configured as a main storage medium for storing the server 202, for example.

The embodiment shown in fig. 2 may also comprise further nodes, which in this example may be implemented with database server 201. However, in other embodiments, other nodes may be implemented as any type of user (such as additional storage nodes, client devices, servers, personal computers, tablet computers, smartphones, etc.). Database server 201 or other node may communicate with storage server 202 using any type of communication interface including the above-mentioned interconnections and/or network interfaces, protocols, etc.

In operation (1), database server 201 may send a request to read data to storage server 202 as indicated by arrow 224. The request may specify the requested data, for example, using the LBAs. For example, database server 201 may send the request over a network channel accessible using a web socket. In operation (2), the storage server 202 may receive the request over the network channel, for example, using a network IO stack.

At operation (3), the LBA may be used at the CPU at the storage server 202 to perform a lookup operation on the hash table 214. Hash table 214 may be arranged in row R ₀、R₁、……、R_N-1 including a hash bucket (hash bucket), where the hash bucket in row R ₀ may be indicated as B _0-0、B_0-1、……、B_{0_M-1} (where m=4 in this example). In the example shown in fig. 2, hash buckets B _0-1、B_0-2 and B _0-3 may include pointers to memory locations in PMEM 210B, which PMEM 210B may store copies of data stored in HDD storage device 210D. Hash bucket B _{0_0} may include LBAs for data blocks within the SSD including buffer cache 210C-1. In some embodiments, the LBAs provided by hash table 214 for the data blocks within the SSD including buffer cache 210C-1 may be a translation or mapping from LBAs provided with the request to internal LBAs within the SSD (e.g., LBAs within the range of 0 to SSD's capacity). In the example shown in fig. 2, the LBAs sent by database server 201 may correspond to bucket B _0-0 as indicated by the diagonal shading.

If the hash table lookup operation determines that the requested data is stored in buffer cache 210C-1 (e.g., a cache hit), the CPU at storage server 202 may perform operation (4-1), and operation (4-1) may read the requested data block 226-1 indicated by the diagonal shadow at the LBA determined by the lookup operation. However, if the hash table lookup operation determines that the requested data is not stored in PMEM210B or buffer cache 210C-1 (e.g., a cache miss), the CPU at storage server 202 may perform operation (4-2), and operation (4-2) may read requested data block 226-2 indicated by the diagonal shading from HDD storage device 210D at the LBA provided with the request or at the LBA converted to an internal LBA within HDD storage device 210D (e.g., an LBA in the range of 0 to the capacity of HDD 210D).

At operation (5), the CPU at storage server 202 may send the requested data 226 (e.g., data block 226-1 from buffer cache 210C-1 or data block 226-2 from HDD storage device 210D) to database server 201 over the network channel, e.g., again using the network IO stack, as indicated by arrow 230. Database server 201 may wait for a response with the requested data to arrive through the network channel, for example, upon completion of network socket read operation (6).

The delay from sending the request at operation (1) to receiving the requested data at the completion of operation (6) may be indicated as time T ₁, and the delay at time T ₁ may depend on, for example, various factors (such as the delay in sending the request, the processing time when the request proceeds up through the network IO stack and in response to proceeding down through the network IO stack, the processing time for the CPU at storage server 202 to perform the hash table lookup operation, the delay or delays in reading the data from one or more caches (e.g., PMEM 210B or buffer cache 210C-1), the delay or delays in reading the data from HDD storage 210D, etc.). Depending on implementation details, for example, the delay T ₁ may be relatively long when accessing relatively small data payloads, which may be particularly disadvantageous.

Fig. 3 illustrates an embodiment of a scheme and/or memory access scheme for accessing data at a storage node using location information, according to a disclosed example embodiment. The embodiment shown in fig. 3 may include one or more elements similar to the embodiment shown in fig. 1 and 2, wherein similar elements may be referred to by reference numerals ending in and/or containing the same numbers, letters, etc.

The embodiment shown in fig. 3 may include a storage node 302 and a user node 301. Storage node 302 and user node 301 may communicate using communication interfaces 306 and 307, respectively, and communication interfaces 306 and 307 may implement any type of communication interface including the above-mentioned interconnect and/or network interfaces, protocols, and the like.

The storage node 302 may include a first interface 308X configured to enable the storage node 302 to communicate with a first type of memory medium 310X and a second interface 308Y configured to enable the storage node 302 to communicate with a second type of memory medium 310Y.

In some embodiments, storage node 302 may configure and/or operate at least a portion of one of memory media 310X or 310Y as a cache for at least a portion of the other memory media. For example, in some embodiments, the second memory medium 310Y may be implemented with an HDD configured as a primary storage device and the first memory medium 310X may be implemented with a buffer cache (such as an SSD flash cache and/or a DRAM cache within an SSD). At least a portion of the first memory medium 310X may be configured as a cache to store copies of data stored in the primary storage device 310Y.

Storage node 302 may include a data structure (e.g., hash table, list, tree structure, etc.) 314 that may store location information 315 to track one or more locations of one or more copies of data stored in a cache portion of first memory medium 310X. For example, the location information 315 may indicate a location within the first memory medium 310X that may store a cached copy of the data stored in the primary storage device 310Y. The data structure 314 may be located anywhere in, for example, internal memory (e.g., cache memory) within the CPU and/or any memory medium 310, such as the first memory medium 310X.

Storage node 302 may also include transmit logic 332, and transmit logic 332 may transmit location information 315 for data stored at storage node 302 from storage node 302 to user node 301 as indicated by arrow 334. The transmit logic 332 may transmit the location information 315 using any communication scheme, such as network transfer using a network IO stack, a memory access scheme (e.g., RDMA), etc., as indicated by arrow 344.

Location information 315 may enable user node 301 to determine one or more locations (e.g., one or more cache locations) where data may be stored at storage node 302. Using location information 315, user node 301 may access data in a manner that may reduce latency, improve bandwidth, etc., depending on implementation details. For example, user node 301 may access data from a cache (e.g., memory medium 310X) using a memory access scheme (such as RDMA) that may bypass some or all of the IO stacks according to implementation details.

Storage node 302 may also include memory access logic 336, which memory access logic 336 may enable user node 301 to access data (e.g., data in a cache (such as a cache portion of memory medium 310X)) at storage node 302 as indicated by arrow 338. In some embodiments, memory access logic 336 may enable user node 301 to access data at storage node 302 in a relatively straightforward manner. For example, the memory access logic 336 may include hardware (e.g., a memory access controller) to which a processor (such as a CPU or CPU core) at the storage node 302 may offload data transfer operations. Depending on implementation details, this may enable the processor to perform one or more other operations in parallel (e.g., overlapping) with the data transfer performed by the memory access logic 336.

As another example, memory access logic 336 may include hardware that may provide a data path (e.g., a pass-through data path) as shown by arrows 338, 340, and/or 342, which may bypass some or all of a processor (such as a CPU or CPU core) and/or an IO stack (e.g., a network IO stack) as shown by arrows 338, 340, and/or 342. In some embodiments, memory access logic 336 may also be used by transmit logic 332 to transmit location information 315 to user node 301 as indicated by arrow 344.

As a further example, the memory access logic 336 may be implemented at least in part in software (e.g., at a CPU or CPU core) that may enable a processor to bypass at least a portion of an IO stack (e.g., a network IO stack) or one or more other software components (such as services, processes, cores, drivers, etc.), which may increase latency and/or reduce bandwidth of data transfer.

Examples of memory access logic 336 may include, for example, a Direct Memory Access (DMA) controller with a bridge to network and/or interconnect interfaces. Another example of memory access logic 336 may include a remote memory access controller (e.g., RDMA controller) that may use an underlying transport scheme such as ethernet, RDMA over converged ethernet (RoCE), infiniBand, iWARP, fibre channel, etc. In some embodiments, the memory access logic 336 may implement a protocol (such as NVMe-of) that may use underlying transport (such as RoCE, infiniBand, etc.).

User node 301 may include location determination logic 346, and location determination logic 346 may enable user node 301 to determine one or more locations from which to access data within storage node 302 based on location information 315. For example, user node 301 may request location information 315 for data previously stored by user node 301 at storage node 302. The storage node 302 may respond by sending location information 315, and the location determination logic 346 may determine that the location information 315 indicates that a copy of the requested data may be located in a cache portion of the first memory medium 310X.

User node 301 may also include memory access logic 348, which memory access logic 348 may enable user node 301 to access data at storage node 302 in a relatively straightforward manner. In some embodiments, memory access logic 348 may implement one or more protocols, interfaces, etc. that may work in coordination with memory access logic 336 at storage node 302. For example, memory access logic 348 at user node 301 and memory access logic 336 at storage node 302 may implement an RDMA scheme in which user node 301 is operable as an initiator and storage node 302 is operable as a target to transfer data from storage node 302 to user node 301.

Although the location determination logic 346 and the memory access logic 348 at the user node 301 may have separate functions, in some embodiments, the memory access logic 348 may be used to retrieve data from a location at the storage node 302 that may be determined by the location determination logic 346. For example, as described above, the location determination logic 346 may receive location information 315, the location information 315 indicating that a copy of the data stored at the storage node 302 may be located in a cache portion of the first memory medium 310X. The location may be indicated, for example, by a memory address, LBA, device identifier, etc. The location determination logic 346 may send the location information 315 or a version that has been processed, interpreted, etc., to the memory access logic 348 as shown by arrow 350. The memory access logic 348 may use the location information 315 to access (e.g., read) data from the cache portion of the first memory medium 310X. For example, memory access logic 348 at user node 301 (operating as initiator) may initiate data transfers with memory access logic 336 at storage node 302 (operating as target) using RDMA, NVMe-oF, or the like.

According to disclosed example embodiments, memory access logic 348 at user node 301 and memory access logic 336 at storage node 302 may be configured to initiate the transfer of data in various ways. For example, in some embodiments, memory access logic 348 at user node 301 may send a memory access request to memory access logic 336 at storage node 302 in the form of a command, a command packet, a message, an instruction, and/or any other type of indication that user node 301 may be requesting to read data from storage node 302.

In some embodiments where memory access logic 348 and 336 may implement RDMA and/or NVMe-orf schemes, memory access logic 348 at user node 301 may be configured as an initiator and memory access logic 336 at storage node 302 may be configured as a target. An initiator (which may also be referred to as a client) may issue a read request that may include a destination memory address in its local memory. The target (which may also be referred to as a server) may respond by retrieving the requested data from one or more locations at the storage node 302 and writing the requested data (e.g., directly) into the initiator's memory at the destination memory address.

In some embodiments implemented with RDMA and/or NVMe-oF, the configuration oF the memory access logic may be reversed such that the memory access logic 348 at the user node 301 may be configured as a target and the memory access logic 336 at the storage node 302 may be configured as an initiator. In such embodiments, the user node 301 may send a command, message, and/or any other indication to the storage node 302 to request the memory access logic 336 to initiate RDMA and/or NVMe-oF transfers.

Depending on implementation details, an embodiment of a scheme for accessing data as shown in fig. 3 may reduce latency, increase bandwidth, etc. For example, a data path (e.g., a pass-through data path) using memory access logic 336 as shown by arrows 338 and 340 may bypass a CPU, CPU core, IO stack, one or more processes, services, cores, drivers, etc. at storage node 302. Depending on implementation details, memory access logic 348 at user node 301 may similarly implement a data path (e.g., pass-through data path) that may bypass a CPU, CPU core, IO stack, one or more processes, services, cores, drivers, etc. at user node 301.

As with the memory access logic 336 at the storage node 302, the memory access logic 348 at the user node 301 may be implemented in hardware, software, or a combination thereof, which may enable a processor (such as a CPU or CPU core) at the user node 301 to offload data transfer operations to the memory access logic 336 (e.g., to enable the processor at the user node 301 to perform one or more other operations in parallel (e.g., overlapping) with the data transfer performed by the memory access logic 348), providing a data path (e.g., a pass-through data path) that may bypass some or all of the processor (such as the CPU or CPU core), services, processes, cores, IO stacks (e.g., network IO stacks), etc.

As with storage node 302, user node 301 is not limited to any particular physical form. For example, user node 301 may be implemented in whole or in part with and/or used in combination with one or more personal computers, tablet computers, smart phones, servers, server enclosures, server racks, data rooms, data centers, edge data centers, mobile edge data centers, and/or any combination thereof.

FIG. 4 illustrates an embodiment of a scheme for accessing data at a storage node using a mirrored data structure in accordance with a disclosed example embodiment. The embodiment shown in fig. 4 may include one or more elements similar to the embodiments shown in fig. 1,2, and/or 3, wherein similar elements may be referred to by reference numerals ending in the same number, letter, etc. and/or containing the same number, letter, etc.

In some aspects, the embodiment shown in fig. 4 may be similar to the embodiment shown in fig. 3. However, in the embodiment shown in FIG. 4, storage node 402 may include update logic 452 that may maintain a copy of at least a portion of data structure 414 at user node 401. For example, update logic 452 may initially send a copy of all or some of data structures 414 including location information 415 to user node 401 as indicated by arrow 435, wherein a copy of all or some of data structures 414 including location information 415 may be stored as location information 415a in data structure (e.g., mirrored data structure) 414a, e.g., as part of a process for starting, resetting, initializing, etc.

As storage node 402 adds, removes, rewrites, flushes (flush), invalidates, etc. cache entries in one or more caches, storage node 402 may update entries in data structure 414 to reflect changes in cache contents. For example, update logic 452 may update a corresponding entry in mirrored data structure 414a at user node 401 in response to data structure 414 at storage node 402 being updated. Thus, in some embodiments, update logic 452 may initiate (e.g., cause) an update (e.g., modification) of one or more entries in mirrored data structure 414 a.

Update logic 452 may update corresponding entries in mirrored data structure 414a at any time (e.g., while data structure 414 is updated, immediately or relatively soon after data structure 414 is updated, at some later time (e.g., as part of a background process), and/or at any other time).

Maintaining the mirror data structure 414a at the user node 401 may reduce latency, increase bandwidth, etc., depending on implementation details. For example, to request data from storage node 402, location detection logic 446a at user node 401 may perform a lookup operation on mirrored data structure 414a to determine whether location information 415 for the requested data is present in mirrored data structure 414 a. If the location information 415 of the requested data exists in the mirrored data structure 414a, the user node 401 may use the location information 415 (e.g., using the memory access logic 448 as indicated by arrows 449 and 450) to read the data from the storage node 402 without first requesting the location information 415 from the storage node 402. Thus, the total delay in reading data from storage node 402 may be reduced by, for example, the amount of time involved in requesting location information 415 from storage node 402.

In some embodiments, the mirror data structure 414a may be initially empty, and entries may be added to the mirror data structure 414a and/or updated at the mirror data structure 414a based on demand (e.g., when the user node 401 sends an access request to the storage node 402). In such embodiments, instead of updating (e.g., sending an entry update to) the mirror data structure 414a based on an update to the data structure 414 at the storage node 402, the update logic 452 may use a flag or other mechanism to inform the user node 401 to invalidate one or more entries in the mirror data structure 414a, for example, if an entry has been updated in the data structure 414 since the user node last accessed the data corresponding to the entry.

Fig. 5 illustrates an example embodiment of a storage node and method for using location information and/or a memory access scheme in accordance with the disclosed example embodiments. The storage node 502 shown in fig. 5 may include one or more elements similar to the embodiments shown in fig. 1,2, 3, and/or 4, where like elements may be referred to by ending with and/or including the same numbers, letters, etc. Storage node 502 may be used, for example, to implement one of the storage nodes shown in fig. 3 and/or 4, or with one of the storage nodes shown in fig. 3 and/or 4.

In some aspects, the storage node 502 shown in fig. 5 may be similar to the embodiment shown in fig. 1. However, the embodiment shown in FIG. 5 may include transmit and/or update logic 554, and transmit and/or update logic 554 may perform one or more operations similar to transmit logic 332 shown in FIG. 3 and/or update logic 452 shown in FIG. 4. Additionally or alternatively, the storage node 502 shown in fig. 5 may include memory access logic 536, and the memory access logic 536 may perform one or more operations similar to the memory access logic 336 and 436 shown in fig. 3 and 4, respectively.

The memory access logic 536 may be implemented using an underlying network (such as ethernet, roCE, infiniBand, iWARP, etc.), for example, with NVMe-oh, RDMA, etc. The memory access logic 536 is not limited to any physical configuration. However, in some example embodiments, the memory access logic 536 may be integrated with the communication interface 506. For example, the memory access logic 536 (e.g., NVMe-oh, RDMA, etc.) and the communication interface 506 (e.g., roCE, infiniBand, iWARP, etc.) may be integrated within a network adapter, which may also be referred to as a Network Interface Card (NIC) and/or a network interface controller (also referred to as a NIC). For purposes of illustration, multiple connections between the memory access logic 536 and the one or more interfaces 508 may be shown as a single bus 561, but any number and/or type of connections may be used.

In some embodiments, as shown by arrows 553, 555, 556, 557, 558, 559, and/or 560, memory access logic 536 may implement one or more relatively direct data paths between communication interface 506 and one or more of interface 508, memory medium 510, data structure 514 (which may be implemented, for example, with a hash table as shown in fig. 5), transmit and/or update logic 554, and the like. Depending on implementation details, one or more of the data paths 553, 555, 556, 557, 558, 559, and/or 560 may operate as a pass-through path, which may, for example, bypass some or all of the data paths through one or more of the I/O stack 512, the CPU 504, the one or more CPU cores 505, and the like.

The data paths 556, 557, 558, 559, and/or 560 are not limited to any particular operation. However, in one example embodiment, the data access operation may proceed as follows. A user (such as user node 301 shown in fig. 3) may read location information 515 of a cached copy of the data stored in HDD storage 510D from storage node 502. The user may read the location information 515 as indicated by arrows 553, 555, 556, and/or 557, e.g., using RDMA access, to access the transmit and/or update logic 554 and/or hash table 514 (which may be stored, e.g., at CPU 504, DRAM 510A, etc.). The user may select a portion (e.g., a hash bucket) of hash table 514 to read based on, for example, the LBAs of data stored in HDD storage device 510D.

The user may receive location information 515 in the form of, for example, a hash bucket, which may include information to identify the device (e.g., one of the memory media 510 if a cached copy of the requested data is stored at the memory media 510), a pointer to a memory address (e.g., if a cached copy of the requested data is stored at the DRAM 510A and/or the PMEM 510B), an LBA (e.g., if a cached copy of the requested data is stored at the SSD 510C, which may be internal, converted, and/or mapped to the LBA of the device), and so forth. The user may interpret the location information 515, for example, using location determination logic, such as location determination logic 346 shown in fig. 3.

Alternatively or additionally, the user may obtain the location information 515 of the cached copy of the data stored in the HDD storage 510D by performing a lookup on a data structure maintained at the user, such as, for example, the data structure 415a shown in fig. 4.

The user may use location information 515 to access a cached copy of the requested data using memory access logic 536. For example, if location information 515 indicates that a cached copy of the requested data is stored in DRAM 510A and/or PMEM 510B, the user may initiate one or more RDMA reads to the cached portions of DRAM 510A and/or PMEM 510B using pass-through paths 557 and/or 558. As another example, if the location information 515 indicates that a cached copy oF the requested data is stored in the SSD 510C, the user may initiate one or more NVMe-orf reads to the cache portion oF the SSD 510C using the pass-through path 559. Additionally or alternatively, the storage node 502 shown in fig. 5 may include a bridge (such as an NVMe-oh to NVMe bridge) that may enable storage (such as SSD 510) implemented with NVMe (e.g., using PCIe) to communicate using the through path 559 using the NVMe-oh through the memory access logic 536. One or more additional pass-through paths (such as pass-through path 560) may be used to access one or more additional memory media (such as HDD 510D).

Depending on implementation details, the apparatus and/or method shown in fig. 5 may reduce latency, increase bandwidth, etc. For example, one or more of the data paths 553, 555, 556, 557, 558, 559, and/or 560 shown in fig. 5 may provide more direct access to location information, cache data, etc., as compared to the request and/or data paths 117, 120, 121, and/or 122 shown in fig. 1, e.g., by bypassing one or more of the I/O stacks 512, CPUs 504, one or more CPU cores 505, etc. For purposes of illustration, data paths 553, 555, 556, 557, 558, 559, and/or 560 may be shown extending from one or more of memory media 510 to communication interface 506, but the data paths may be used in other directions, for example, to write data to one or more of memory media 510.

Fig. 6 illustrates an example embodiment of a method and apparatus for accessing data at a storage node using location information and/or a memory access scheme, according to an example embodiment of the disclosure. The embodiment shown in fig. 6 may be implemented, for example, using a storage node similar to the embodiment shown in fig. 5, wherein similar elements may be referred to by reference numerals ending in and/or containing the same numbers, letters, etc. The embodiment shown in fig. 6 may also include one or more elements similar to those shown in fig. 1,2, 3, and/or 4, where like elements may be referred to by ending with and/or including the same numbers, letters, etc.

Referring to FIG. 6, a storage node may be implemented with a storage server 602, and the storage server 602 may include a PMEM 610B and a buffer cache 610C-1. The buffer cache 610C-1 may be implemented, for example, using volatile memory (e.g., DRAM) that may be configured as a cache within an SSD (e.g., SSD 510C shown in fig. 5). Alternatively or additionally, buffer cache 610C-1 may be implemented with flash memory (e.g., as a flash cache) within an SSD (such as SSD 510C shown in FIG. 5).

Storage server 602 may also include hash table 614, where hash table 614 is configured to track the location of data stored in PMEM 610B and/or buffer cache 610C-1, and either or both of PMEM 610B and/or buffer cache 610C-1 may be configured to cache data stored in, for example, a HDD storage device (not shown) at storage server 602. The HDD storage device may be configured as a main storage medium for storing the server 602, for example.

The embodiment shown in fig. 6 may also comprise further nodes, which in this example may be implemented with a database server 601. However, in other embodiments, the additional nodes may be implemented as any type of user (such as additional storage nodes, client devices, servers, personal computers, tablet computers, smartphones, etc.). Database server 601 or other node may communicate with storage server 602 using any type of communication interface including the above-mentioned interconnections and/or network interfaces, protocols, etc. However, for purposes oF illustration, database server 601 and storage server 602 may be assumed to communicate using at least one or more networks that may support memory access schemes (such as RDMA and/or NVMe-oF).

In operation (1), database server 601 may use a block identifier (block ID) of data read from storage server 602 (e.g., based on LBA) to determine a bucket address of hash table 614. If at least a portion of hash table 614 is stored locally at database server 601 as mirrored hash table 614a, database server 601 may perform operation (2-1), in which operation (2-1) database server 601 may perform a lookup operation on mirrored hash table 614a using the block ID to determine if the portion of locally stored mirrored hash table 614a includes a hash bucket (e.g., an entry) for data to be read from storage server 602, and if the portion of locally stored mirrored hash table 614a includes a hash bucket (e.g., an entry) for data to be read from storage server 602, obtain the hash bucket.

However, if mirrored hash table 614a is not stored locally at database server 601, or if a portion of locally stored mirrored hash table 614a does not include a hash bucket corresponding to a block ID, database server 601 may perform operation (2-2), in which operation (2-2) database server 601 may read the hash bucket from hash table 614 using the block ID as indicated by arrow 625. The storage server 602 may read the hash bucket, for example, using RDMA access of the hash table 614, which may be stored in DRAM, PMEM, etc. at the storage server 602.

If database server 601 is unable to obtain the hash bucket corresponding to the block ID (e.g., from mirrored hash table 614a at database server 601 and/or from hash table 614 at storage server 602), database server 601 may determine that a cached copy of the data corresponding to the block ID is not stored in the cache at storage server 602. Thus, database server 601 may obtain data from HDD storage 610D at storage server 602 using the request described above with respect to fig. 1 (e.g., sent over a network IO stack).

However, if database server 601 is able to obtain a hash bucket (e.g., from mirrored hash table 614a at database server 601 and/or from hash table 614 at storage server 602), database server 601 may perform operation (3), in which (3) the location information in the hash bucket may be processed (e.g., by parsing, interpreting, looking up, etc.) to determine the location of the cached copy of the data at storage server 602. For example, if the cached copy is stored at DRAM, PMEM 610B, etc. at storage server 602 (e.g., in bucket B _0-1), the location may include a memory address (e.g., a pointer to the memory location address). As another example, if a cached copy of the data is stored at buffer cache 610C-1 (e.g., in bucket B _0-0), the location may include the LBA (or range of LBAs).

Database server 601 may use the location of cached copies of data to read data from storage server 602. For example, if a cached copy of the data corresponding to the block ID is stored at PMEM 610B, then at operation (4-1), database server 601 may read data 626-1 from PMEM 610B using, for example, an RDMA read as indicated by arrow 631. However, if a cached copy oF the data corresponding to the block ID is stored at buffer cache 610C-1, then in operation (4-2), database server 601 may read data 626-2 from PMEM 610B using, for example, NVMe-oF read as indicated by arrow 633.

At operation (5), the database server 601 may perform a synchronous Poll operation (Sync Poll) in which the memory access scheme may be polled (e.g., continuously, periodically, etc.) to determine whether the data read operation is complete (e.g., by writing data 626 to memory at the database server 601). In some embodiments, for example, because a relatively direct data read (such as an RDMA or NVMe-ofr read) may not involve a request-response, a synchronous poll operation may be used, and thus, database server 601 may not receive a response (e.g., from the IO stack) indicating that the data transfer has completed.

The delay from determining the bucket address at operation (1) to receiving the data at the completion of operation (5) may be indicated as time T ₂. Depending on implementation details, the embodiment shown in fig. 6 may reduce latency, increase bandwidth, etc., for example, by bypassing one or more of the I/O stacks, CPUs, CPU cores, etc., at database server 601 and/or storage server 602. The embodiment shown in fig. 6 is not limited to any particular implementation details. However, for comparison purposes, in some embodiments, the delay T ₁ shown in FIG. 2 may generally be about 200 μs, while the delay T ₂ shown in FIG. 6 may be several microseconds for RDMA reads from PMEM 610B and/or about 100 μs for NVMe-oF reads from buffer cache 610C-1.

Fig. 7 illustrates an embodiment of a memory device according to a disclosed example embodiment. The storage device 710 shown in fig. 7 may be used to implement one or more of the memory media disclosed herein. For example, storage 710 may be used to implement any of SSDs 110C, 210C, 510C, and/or 610C shown in fig. 1,2, 5, and/or 6.

Referring to fig. 7, a storage device 710 may include a first memory medium 762 and a second memory medium 764. Although memory media 762 and 764 are not limited to any particular type of media, in some example embodiments, first memory medium 762 may be implemented with volatile and/or byte-addressable types of memory media (such as DRAM and/or PMEM), while second memory medium 764 may be implemented with non-volatile types of memory media (such as NAND flash memory) that may be addressed in pages, blocks, etc. For example, in an embodiment where storage device 710 is implemented with an SSD, first memory medium 762 may be implemented with DRAM and second memory medium 764 may be implemented with NAND flash memory as shown in FIG. 7.

For example, the storage device 710 may include a buffer cache 766, the buffer cache 766 may be used to store one or more cached copies of data or portions of data stored in the second memory medium 764 to provide access to the data with a lower latency than may be involved with reading the data from the second memory medium 764. Buffer cache 766 may be implemented with a write-back mechanism, a write-through mechanism, and/or any other type of cache mechanism, as indicated by arrow 768.

Storage 710 may include one or more communication interfaces 770 that may be implemented, for example, with any type of interconnect and/or network interface, protocol, etc. described herein, or a combination thereof. For example, in some embodiments, the communications interface 770 may be implemented with one or more network transmission schemes (such as ethernet, roCE, infiniBand, etc.) that may support one or more protocols (such as RDMA, NVMe-orf, etc.). In some embodiments, the communications interface 770 may be implemented with an interconnect (such as PCIe) that may support NVMe protocols. In such embodiments, an NVMe-to-NVMe-oh bridge may be included (e.g., in one or more communication interfaces 770 and/or at a storage node where storage 710 may be located) to enable storage 710 to transfer data using a memory access scheme (such as memory access logic 336, 436, and/or 536 described above).

In an SSD embodiment where the first memory medium 762 is implemented with DRAM and the second memory medium 764 is implemented with NAND flash, if the memory space (e.g., address space) of the NAND764 is greater than the memory space of the DRAM cache 766, then the NAND random read may provide a relatively slow (e.g., slowest) access path that may be the sum of the general access latency of the SSD710 (e.g., protocol controller, flash Translation Layer (FTL), etc.) plus the NAND latency. In such an embodiment, the average latency may be given by DRAM latency times hit rate+NAND latency times (1-hit rate). If the requested data is located in the flash cache, the user node may read the data block from the flash cache using, for example, NVMe-oF.

Fig. 8 illustrates an example embodiment of a node device that may be used to implement any of the node functions disclosed herein, in accordance with the example embodiments disclosed. The node device 800 shown in fig. 8 may include a processor 802, a system memory 806, node control logic 808, and/or a communication interface 813, and the processor 802 may include a memory controller 804. Any or all of the components shown in fig. 8 may communicate via one or more system buses 812. In some embodiments, one or more of the components shown in fig. 8 may be implemented using other components. For example, in some embodiments, the node control logic 808 may be implemented by the processor 802 executing instructions stored in the system memory 806 or other memory.

The node control logic 808 may be operable to implement any of the node functions disclosed herein (e.g., one or more of the location determination logic 346 and/or 446, memory access logic 336, 436 and/or 536, transmit and/or update logic 332, 452 and/or 554, etc., described above with respect to fig. 3,4, 5 and/or 6).

FIG. 9 illustrates an example embodiment of a storage device that may be used to implement any of the storage device functions disclosed herein, in accordance with the example embodiments disclosed. The storage device 900 may include a device controller 902, a media conversion layer 904 (e.g., FTL), a storage medium 906, cache control logic 916, and a communication interface 910. The components shown in fig. 9 may communicate via one or more device buses 912. In some embodiments, where flash memory may be used for some or all of the storage media 906, the media conversion layer 904 may be implemented partially or fully as a flash conversion layer (FTL).

The cache control logic 916 may be used to implement any of the storage device cache functions disclosed herein, such as one or more of the buffer cache 610C-1 and/or the buffer cache 766 on the flash cache described above with respect to fig. 6 and/or 7.

Fig. 10 illustrates an embodiment of a method for accessing data from a storage node in accordance with a disclosed example embodiment. The method may begin at operation 1002. At operation 1004, the method may receive data (e.g., data to be stored in a hard disk drive (such as HDD 510D shown in fig. 5)) at a first node. At operation 1006, the method may store at least a portion of the data in a cache at the first node. For example, at least a portion of the data may be stored in a cache located at the DRAM510A, PMEM B and/or SSD (or DRAM cache therein) shown in fig. 5. At operation 1008, the method may transmit location information for at least a portion of the data from the first node to the second node. For example, the transmit and/or update logic 554 shown in fig. 5 may transmit at least a portion of the location information 515 to another node using the communication interface 506. At operation 1010, the method may use a memory access scheme to transfer at least a portion of the data from the cache to the second node based on the location information. For example, the memory access logic 536 shown in FIG. 5 may use RDMA, NVMe-oF, or the like to transfer data. The method may end at operation 1012.

The embodiment shown in fig. 10, as well as all other embodiments described herein, are example operations and/or components. In some embodiments, some operations and/or components may be omitted, and/or other operations and/or components may be included. Furthermore, in some embodiments, the temporal and/or spatial order of operations and/or components may be changed. Although some components and/or operations may be shown as separate components, in some embodiments, some components and/or operations shown separately may be integrated into a single component and/or operation and/or some components and/or operations shown as a single component and/or operation may be implemented with multiple components and/or operations.

Any of the storage devices disclosed herein, including devices 110, 210, 310, 410, 510, 610, and/or 710, may be implemented in any form factor (such as 3.5 inches, 2.5 inches, 1.8 inches, m.2, enterprise, and data center standard form factors (EDSFF), NF1, etc.) using any connector configuration, such as serial ATA (SATA), small Computer System Interface (SCSI), serial Attached SCSI (SAS), U.2, etc. Any of the storage devices disclosed herein may be implemented in whole or in part with and/or used in combination with a server chassis, a server rack, a data room, a data center, an edge data center, a mobile edge data center, and/or any combination thereof.

Any of the functions described herein may be implemented in hardware, software, firmware, or any combination thereof that executes instructions stored in any type of memory, including any function that may be implemented with a node, storage device, etc., or any combination thereof, including, for example, the location determination logic 346 and/or 446, memory access logic 336, 436, and/or 536, transmit and/or update logic 332, 452, and/or 554, etc., described above with respect to fig. 3,4, 5, and/or 6, and any function described with respect to the embodiments shown in fig. 8 and/or 9, including, for example, hardware and/or software combination logic, sequential logic, timers, counters, registers, state machines, volatile memory (such as SRAM), non-volatile memory (including flash memory, persistent memory (such as cross-grid non-volatile memory, PCM with resistance change, CPLD, FPGA, ASIC, CPU, or any combination thereof including CISC 26), DPU (processor) or DPU (processor), etc., as described herein. In some embodiments, one or more components may be implemented as a system on a chip (SOC).

Although embodiments disclosed herein are not limited to any particular application, one or more embodiments of a scheme for accessing data at a storage node may be beneficial, for example, for databases that may be accessed from hard drives that may be configured with one or more (e.g., multiple tiered) caches during a data retrieval process. Such embodiments may include a database server and a storage server. Some embodiments may include one or more (e.g., many) servers in one rack (e.g., 10 servers in each rack). One or more database servers may process user queries, and/or may analyze requests and/or process. In order for a user to access data, the user may first access a storage server. The data may be stored on, for example, flash memory, hard disk, etc., and the storage server may provide the data as needed. Different components in the storage server may provide data storage devices with different types of latency.

Some embodiments may be used to implement data prefetching, for example, using a memory access scheme (such as RDMA, NVMe-oF, etc.) to implement low latency data prefetching for database operations.

In some embodiments, the system may use RDMA (e.g., with RoCE transmissions) to access the data to reduce PMEM latency. Some embodiments may include a data server and a storage server. Such a system may maintain various types of memory (e.g., PMEM, flash cache, etc.). For example, a predetermined size of memory (e.g., 8k blocks) may be used to store data in DRAM, flash memory, and/or PMEM in the system. If a predetermined size of memory (e.g., 8k blocks) is available to store the data in DRAM, the data may be read directly into local memory in the database server using, for example, RDMA. Some embodiments may implement two RDMA operations to access data. For example, a first operation may read a hash table on a storage server to calculate which bucket of DRAM to use. Thus, the system may load data to a database server, which may examine the bucket to determine whether the requested data block is cached, and determine the type of memory (e.g., PMEM, flash, cache, etc.) that stores the data block. Thus, the first operation may obtain metadata information (e.g., an address of a data block), and the second operation may use RDMA to read the actual data from DRAM, persistent memory, or the like.

Some embodiments may implement one or more techniques to reduce or minimize latency in accessing memory and/or storage devices, e.g., at a storage node. For example, some embodiments may include one RDMA operation and one NVMeOF operation to access data. The first operation may include reading a hash bucket from a hash table on the storage server using an RDMA read operation to calculate which data block to read. Thus, the system may load the hash bucket to a database server, which may examine the bucket contents to see if the data block is cached and determine the type of memory (e.g., persistent memory, flash memory, DRAM, etc.) in which the data block is stored. Thus, the first operation may obtain metadata information (e.g., an address of a data block). If the data is stored in the NVMe storage, the second operation may comprise NVMe-oF to read the actual data block from the flash cache.

Some embodiments disclosed above have been described in the context of various implementation details, but the principles of the present disclosure are not limited to these or any other specific details. For example, some functions have been described as being implemented by a particular component, but in other embodiments, functions may be distributed among different systems and components in different locations and with various user interfaces. Particular embodiments have been described as having particular processes, operations, etc., but these terms also encompass embodiments in which a particular process, operation, etc., may be implemented with multiple processes, operations, etc., or embodiments in which multiple processes, operations, etc., may be integrated into a single process, step, etc. References to a component or element may only represent a portion of the component or element. For example, a reference to a block may represent an entire block, or one or more sub-blocks. Unless otherwise clear from the context, the use of terms (such as "first" and "second") in the present disclosure and claims may be used solely for the purpose of distinguishing between elements that they modify and may not indicate any spatial or temporal order. In some embodiments, a reference to an element may represent at least a portion of the element, e.g., "based on" may represent "based at least in part on" or the like. The reference to a first element may not imply the presence of a second element. The principles disclosed herein have independent utility and may be implemented separately and not every embodiment may utilize every principle. However, the principles may also be implemented in various combinations, some of which may amplify the benefits of the respective principles in a synergistic manner. The various details and embodiments described above may be combined to produce additional embodiments according to the inventive principles of this patent disclosure.

Since the inventive principles of this patent disclosure may be modified in arrangement and detail without departing from the inventive concepts, such changes and modifications are considered to be within the scope of the appended claims.

Claims

1. An apparatus for accessing data, comprising:

A storage node, comprising:

a first interface for communicating with a first memory medium;

A second interface for communicating with a second memory medium; and

At least one control circuit configured to:

transmitting location information of data stored in the first memory medium from the storage node; and

The data is transferred from the storage node using a memory access scheme.

2. The device of claim 1, wherein the at least one control circuit is further configured to: at least a portion of the first memory medium is operated as a cache for at least a portion of the second memory medium.

3. The device of claim 1, wherein the at least one control circuit is further configured to: the location information is transmitted using a memory access scheme.

4. The device of claim 1, wherein the at least one control circuit is further configured to:

Receiving a request for the location information; and

And sending the position information based on the request.

5. The device of claim 1, wherein the at least one control circuit is further configured to:

updating the location information to generate updated location information; and

And performing transmission of the updated location information from the storage node.

6. The apparatus of claim 5, wherein the transmission of the updated location information is caused by a storage node.

7. The device of any of claims 1-6, wherein the at least one control circuit is further configured to:

receiving a request to transmit the data; and

Based on the request, the data is transferred from the storage node using a memory access scheme.

8. The device of claim 7, wherein the request to transmit the data comprises a command.

9. The apparatus of any one of claims 1 to 6, wherein:

The storage node includes a network adapter; and

The network adapter includes at least a portion of a memory access scheme.

10. An apparatus for accessing data, comprising:

A node comprising at least one control circuit configured to:

transmitting data from the node;

receiving location information of the data at the node; and

Based on the location information, a memory access scheme is used to communicate the data to the node.

11. The device of claim 10, wherein the location information identifies a memory medium.

12. The device of claim 11, wherein the location information identifies a location within a memory medium.

13. The device of claim 10, wherein the location information identifies a cache for the data.

14. The device of claim 10, wherein the at least one control circuit is further configured to:

transmitting a request for the location information from the node; and

The location information is received at the node based on the request.

15. The device of claim 10, wherein the at least one control circuit is further configured to: and storing a data structure comprising the position information.

16. The device of claim 15, wherein the at least one control circuit is further configured to:

receiving updated location information at the node; and

The data structure is modified based on the updated location information.

17. The apparatus of any one of claims 10 to 16, wherein:

the node comprises a network adapter; and

The network adapter includes at least a portion of a memory access scheme.

18. The device of any of claims 10 to 16, wherein the at least one control circuit is further configured to: the data is communicated to the node based on a request for a memory access scheme.

19. A method for accessing data, the method comprising:

Receiving data at a first node;

Storing at least a portion of the data in a cache at a first node;

Transmitting location information of the at least a portion of the data from the first node to the second node; and

Based on the location information, the at least a portion of the data is transferred from the cache to a second node using a memory access scheme.

20. The method of claim 19, wherein the step of transmitting the location information is performed using a memory access scheme.