WO2021218038A1 - 一种存储***、内存管理方法和管理节点 - Google Patents

一种存储***、内存管理方法和管理节点 Download PDF

Info

Publication number
WO2021218038A1
WO2021218038A1 PCT/CN2020/119857 CN2020119857W WO2021218038A1 WO 2021218038 A1 WO2021218038 A1 WO 2021218038A1 CN 2020119857 W CN2020119857 W CN 2020119857W WO 2021218038 A1 WO2021218038 A1 WO 2021218038A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
data
storage
address
node
Prior art date
Application number
PCT/CN2020/119857
Other languages
English (en)
French (fr)
Inventor
崔文林
黄克骥
张鹏
罗四维
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2021569533A priority Critical patent/JP7482905B2/ja
Priority to EP20933906.8A priority patent/EP3958107A4/en
Priority to US17/510,388 priority patent/US11861204B2/en
Publication of WO2021218038A1 publication Critical patent/WO2021218038A1/zh
Priority to US18/527,353 priority patent/US20240094936A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • G06F3/0649Lifecycle management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of storage technology, in particular to a storage system, a memory management method, and a management node.
  • SCM Storage Class Memory
  • the first aspect of the present application provides a storage system, which includes a management node, one or more first memories, and one or more second memories.
  • the management node is used to create a memory pool to provide data storage services.
  • the performance of the first memory is higher than that of the second memory, wherein at least one of the one or more first memories is located at the first storage node, and at least one of the one or more second memories is There is a second memory located at the second storage node.
  • the management node is further configured to control the migration of the data between the first memory and the second memory in the memory pool.
  • the memory pool provided by the first aspect includes at least the following situations: 1) The first storage node contains the first storage and the second storage, and the second storage node also contains the first storage and the second storage. All the second memories are part of the memory pool; 2) The first storage node only includes the first memory, the second storage node only includes the second memory, and both the first memory and the second memory are part of the memory pool. Part; 3) The first storage node contains a first storage and a second storage, and the second storage node contains only one of the first storage or the second storage, and these first storage and the second storage are both the memory pool a part of.
  • the storage system may also include other storage nodes, and various types of memories included in the other storage nodes may also provide storage space for the memory pool.
  • the performance of the first memory and the second memory are different due to different types. Generally speaking, the performance of the first memory is higher than that of the second memory, and the performance here refers to the operation speed and/or access delay of the memory.
  • the first memory is a dynamic random access memory
  • the second memory is a storage-level memory.
  • the first memory and the second memory in the first aspect are divided according to types. For example, for a dynamic random access memory, whether it is located in the first storage node or the second storage node, it is called the first storage node. Memory. As for storage-level memory, whether it is located in the first storage node or the second storage node, it is called the second memory.
  • the memory pool may also include a third memory, a fourth memory, and so on.
  • the memory pool is created based on the storage of multiple performances, and the storages of multiple performances are located on different storage nodes, thereby realizing a cross-node memory pool that integrates storages of different performances. It enables various types of storage (no matter it is a memory or a hard disk) to be used as storage resources to provide storage services to the upper layer, so as to better exert its performance advantages. Since the memory pool contains memories with different performances, it is possible to control the migration of data between memories with different performances based on its access frequency. It is possible to migrate to high-performance memories when the access frequency of data is high to improve the read data. Efficiency, you can also migrate to low-performance memory when the frequency of data access is low to save the storage space of high-performance memory. In addition, the memory pool in this application provides storage space for computing nodes or LUNs, which changes the processor-centric architecture of memory resources.
  • the management node is also used to obtain state information of the memory, and the state information includes the type and capacity of the first memory and the type and capacity of the second memory.
  • the capacity of the memory pool depends on the capacity of the memory provided by each storage node, and the type of memory contained in the memory pool depends on the type of memory provided by each storage node. Therefore, the management node needs to collect memory state information from each storage node before creating the memory pool.
  • the memory pool can be expanded when the available space is insufficient or when a new storage node is discovered by the storage system. During the expansion, the management node also needs to collect memory state information from the new storage node.
  • the storage space of the memory pool includes several pages.
  • the location of a page in the memory pool is called a global address.
  • the global address is mapped to the physical address of the page.
  • the address is used to indicate the location of the physical space allocated for the page in the memory in the storage node.
  • the size of the page is 4KB, or 8KB, or 16KB and so on.
  • the page size can be a fixed size, or it can be configured into different sizes according to requirements. Memory pools with pages of different sizes are more flexible when used.
  • both the first storage node and the second storage node store an index table, and the index table is used to record the mapping relationship between the global address of the page and the physical address of the page.
  • physical addresses can be allocated to the global addresses of certain pages in advance, and the mapping relationship between the global addresses and the physical addresses is recorded in the index table, then when the storage node receives a data write request for these pages At this time, the physical address can be queried directly according to the index table, so as to write the data to be written into the physical space indicated by the physical address.
  • the pre-allocation method data can be directly written into the pre-allocated physical space when the data write request is executed, thereby improving the efficiency of writing data.
  • physical addresses may not be allocated to the pages in advance, but when the storage node receives a data write request for these pages, physical space is allocated from the memory, and the data to be written is written into the physical space. And the mapping relationship between the global address and the physical address of the page is recorded in the index table.
  • This on-demand allocation method can allocate physical space more flexibly, dividing as much as needed, achieving the purpose of saving space.
  • any storage node When any storage node updates the index table, it can send the updated content to other nodes and management nodes.
  • Each storage node has a complete index table, thereby avoiding the problem of opaque information. Because each storage node may receive a data read request from a computing node, the physical address of the data to be read can be queried through the index table to obtain correct data.
  • this application provides at least three implementations regarding the execution of the data write request.
  • the first storage node includes an IO controller, the IO controller stores the index table, the IO controller communicates with the computing node, the IO controller stores the index table, and The IO controller communicates with the computing node.
  • the IO controller is configured to receive the first data sent by the computing node and the first logical address of the first data, and determine the first logical address of the first data in the memory pool according to the first logical address.
  • Global address determining whether physical space has been allocated for the first global address according to the index table, and when it is determined that physical space has been allocated for the first global address, writing the first data to the first global address In the physical space indicated by a physical address.
  • LUN semantics are used for communication between the first storage node and the computing node, and the first logical address refers to the LUN ID, LBA, and length.
  • the IO controller determines the first global address of the first data in the memory pool according to the first logical address, and determines the first physical address corresponding to the first global address according to the index table.
  • the first data is read from the physical space indicated by the first physical address.
  • the first storage node and the computing node adopt memory semantics for communication.
  • the first logical address refers to the ID, start address, and length of the virtual space.
  • the execution process of the write data request is the same as the above-mentioned realization process. The same is true for the execution process of the read data request.
  • each storage node in the storage system maps its local part of the memory pool to the computing node, so that the computing node can "see” the memory pool and obtain the global address of the memory pool. Then the data write request sent by the computing node to the first storage node can directly carry the global address, and the IO controller in the first storage node determines the physical address corresponding to the global address according to the index table, and writes the data to the In the physical space indicated by the physical address.
  • the computing node sends a read data request to the first storage node, the read data request also carries the global address of the data to be read, and the IO controller can directly determine the global address according to the index table The corresponding physical address. According to this implementation, the efficiency of writing data and reading data can be improved, because the steps of converting logical addresses to global addresses are reduced.
  • the management node is specifically used to instruct each storage node to monitor the access frequency of its local data when controlling data migration between the first memory and the second memory in the memory pool, Instruct the storage node to migrate the data to a storage with higher performance when the access frequency is higher than a certain threshold; or instruct the storage node to migrate the data to a storage with lower performance when the access frequency is lower than a certain threshold.
  • the migration here can be a migration within a storage node or a cross-node migration.
  • access frequency monitoring can be performed at page granularity
  • data migration can be performed at page granularity. It is also possible to monitor the access frequency with the data item as the granularity, and to perform the data migration with the data item as the granularity. Data items are units with a smaller granularity than pages. Monitoring with data items as granularity can more accurately identify hot data (or non-hotspots), and migrate real hot data to higher-performance storage to improve read data s efficiency.
  • the memory provided by the first aspect also supports data prefetching. Specifically, during the execution of any of the above-mentioned read data requests, or after the read data request is completed, other data associated with the data to be read is prefetched into a higher-performance memory, and the associated data refers to Data whose logical address is continuous with the logical address of the data to be read. Data prefetching can improve the hit rate of high-performance memory and improve the efficiency of reading data.
  • the storage system of the first aspect is suitable for scenarios where computing and storage are separated.
  • the computing node is independent of any storage node in the system.
  • the communication between the computing node and the storage node is realized through an external network.
  • the advantage of this method is that it is more convenient to expand. For example, when computing resources are insufficient, the number of computing nodes can be increased, and the number of storage nodes remains unchanged. When the storage resources are insufficient, the number of storage nodes can be increased, and the number of computing nodes remains unchanged.
  • the storage system of the first aspect is suitable for scenarios where computing and storage are integrated.
  • a computing node and a storage node are integrated in the same physical device.
  • this physical device that integrates computing and storage a storage server, or also a storage node.
  • the communication between the computing node and the storage node is realized through the internal network, so when the read data request or the write data request is executed, the access delay is relatively low.
  • the second aspect provides a memory management method, which is applied to a storage system, and a management node or a storage node in the storage system executes the method steps therein to realize the functions of the first aspect.
  • a third aspect provides a management node, where the management node is located in the storage system provided in the first aspect, and the storage system includes one or more first memories and one or more second memories.
  • the management node includes a creation module control module.
  • the creation module is used to create a memory pool to provide data storage services, the memory pool includes the one or more first memories and the one or more second memories, and the performance of the first memories is higher than In the second memory, at least one of the one or more first memories is located in the first storage node, and at least one of the one or more second memories is located in the second storage node.
  • the control module is configured to control the data migration between the first memory and the second memory in the memory pool.
  • the management node provided by the third aspect creates a memory pool based on a variety of performance memories, and these multiple performance memories are located on different storage nodes, thereby realizing a cross-node memory pool that integrates different performance memories, so that each Various types of storage (either memory or hard disk) can be used as storage resources to provide storage services to the upper layer, so as to better exert its performance advantages.
  • the memory pool contains memories with different performances, it is possible to control the migration of data between memories with different performances based on its access frequency. It is possible to migrate to high-performance memories when the access frequency of data is high to improve the read data. Efficiency, you can also migrate to low-performance memory when the frequency of data access is low to save the storage space of high-performance memory.
  • the memory pool in this application provides storage space for computing nodes or LUNs, which changes the processor-centric architecture of memory resources.
  • the creation module is further configured to obtain state information of the memory, the state information including the type and capacity of the first memory, and the type and capacity of the second memory.
  • the creation module is specifically configured to create the memory pool according to the status information.
  • the storage space of the memory pool includes several pages, and the global address of a page in the memory pool is mapped to the physical address of the page; wherein, the global address of the page is used to indicate the page In the location in the memory pool, the physical address of the page is used to indicate the location of the physical space allocated for the page in the memory of the storage node.
  • the data is stored in the first memory
  • the control module is specifically configured to instruct the first storage node to obtain the access frequency of the data; and instruct the first storage node to be in the When the access frequency is lower than the set frequency threshold, the data is migrated to the second memory in the memory pool.
  • a fourth aspect provides a management node, where the management node is located in the storage system provided in the first aspect, and the storage system includes one or more first memories and one or more second memories.
  • the management node includes an interface and a processor.
  • the processor is configured to create a memory pool to provide a service for storing data, the memory pool including the one or more first memories and the one or more second memories, and the performance of the first memory is higher than that of all The second memory, wherein at least one of the one or more first memories is located at the first storage node, and at least one of the one or more second memories is located at the second storage node .
  • the processor is further configured to control the migration of the data between the first memory and the second memory in the memory pool.
  • the interface is used to communicate with the first storage node and the second storage node.
  • the management node provided by the fourth aspect creates memory pools based on storages with multiple performances, and these storages with multiple performances are located on different storage nodes, thereby realizing cross-node memory pools that integrate storages with different performances, so that each Various types of storage (either memory or hard disk) can be used as storage resources to provide storage services to the upper layer, so as to better exert its performance advantages.
  • the memory pool contains memories with different performances, it is possible to control the migration of data between memories with different performances based on its access frequency. It is possible to migrate to high-performance memories when the access frequency of data is high to improve the read data. Efficiency, you can also migrate to low-performance memory when the frequency of data access is low to save the storage space of high-performance memory.
  • the memory pool in this application provides storage space for computing nodes or LUNs, which changes the processor-centric architecture of memory resources.
  • the processor is further configured to obtain state information of the memory through the interface, where the state information includes the type and capacity of the first memory and the type and capacity of the second memory.
  • the processor is specifically configured to create the memory pool according to the state information.
  • the storage space of the memory pool includes several pages, and the global address of a page in the memory pool is mapped to the physical address of the page; wherein, the global address of the page is used to indicate the page In the location in the memory pool, the physical address of the page is used to indicate the location of the physical space allocated for the page in the memory of the storage node.
  • the data is stored in the first memory, and the processor is specifically configured to instruct the first storage node to obtain the access frequency of the data; and instruct the first storage node to be in the When the access frequency is lower than the set frequency threshold, the data is migrated to the second memory in the memory pool.
  • a fifth aspect provides a computer-readable storage medium, the storage medium stores program instructions, and the program instructions are used to execute the following method: creating a memory pool to provide a service for storing data, the memory pool including one or A plurality of first memories and one or more second memories, the performance of the first memories is higher than that of the second memories, wherein at least one of the one or more first memories is located in the first Storage node, at least one of the one or more second storages is located in the second storage node; and controlling the data between the first storage and the second storage in the memory pool migrate.
  • the method further includes acquiring state information of the memory, the state information including the type and capacity of the first memory, and the type and capacity of the second memory.
  • the creating a memory pool specifically includes: creating the memory pool based on the state information.
  • the storage space of the memory pool includes several pages, and the global address of a page in the memory pool is mapped to the physical address of the page; wherein, the global address of the page is used to indicate the page In the location in the memory pool, the physical address of the page is used to indicate the location of the physical space allocated for the page in the memory of the storage node.
  • controlling the migration of the data between the first memory and the second memory in the memory pool specifically includes: instructing the first storage node to obtain the access frequency of the data ; Instruct the first storage node to migrate the data to the second memory when the access frequency is lower than a set frequency threshold.
  • a sixth aspect provides a computing program product, the computer program product comprising: computer program code, when the computer program code is executed, the method executed by the management node or the computing node in the above aspects is executed.
  • FIG. 1 is a schematic diagram of the system architecture of the storage system provided by this embodiment
  • FIG. 2 is a schematic diagram of the structure of a storage node provided by this embodiment
  • FIG. 3 is a schematic diagram of the structure of the IO controller provided by this embodiment.
  • FIG. 4 is a schematic diagram of the architecture of the memory pool provided by this embodiment.
  • FIG. 5 is a schematic diagram of each level of memory included in the memory pool provided by this embodiment.
  • FIG. 6 is a schematic diagram of the architecture of another memory pool provided by this embodiment.
  • FIG. 7 is a schematic diagram of the architecture of another memory pool provided by this embodiment.
  • FIG. 8 is a schematic diagram of the architecture of the storage pool provided in this embodiment.
  • FIG. 9 is a schematic flowchart of the data writing method provided by this embodiment.
  • FIG. 10 is a schematic flowchart of another data writing method provided by this embodiment.
  • FIG. 11 is a schematic flowchart of a method for reading data provided by this embodiment.
  • FIG. 12 is a schematic diagram of a structure of a management node provided by this embodiment.
  • FIG. 13 is a schematic diagram of another structure of the management node provided by this embodiment.
  • the storage system provided in this embodiment includes a computing node cluster and a storage node cluster.
  • the computing node cluster includes one or more computing nodes 100 (two computing nodes 100 are shown in FIG. 1, but not limited to two computing nodes 100).
  • the computing node 100 is a computing device on the user side, such as a server, a desktop computer, and the like.
  • a processor and memory are provided in the computing node 100 (not shown in FIG. 1).
  • an application 101 application for short
  • client program 102 client for short
  • Application 101 is a general term for various application programs presented to users.
  • the client 102 is configured to receive a data access request triggered by the application 101, interact with the storage node 20, and send the data access request to the storage node 20.
  • the client 102 is also used to receive data from the storage node and forward the data to the application 101. It can be understood that when the client 102 is a software program, the functions of the client 102 are implemented by a processor included in the computing node 100 running a program in the memory.
  • the client 102 may also be implemented by hardware components located inside the computing node 100. Any client 102 in the computing node cluster can access any storage node 20 in the storage node cluster.
  • the storage node cluster includes one or more storage nodes 20 (three storage nodes 20 are shown in FIG. 1, but not limited to three storage nodes 20), and each storage node 20 can be interconnected. Storage nodes such as servers, desktop computers or storage array controllers, hard disk enclosures, etc. In terms of function, the storage node 20 is mainly used to calculate or process data. In addition, the storage node cluster also includes a management node (not shown in FIG. 1). The management node is used to create and manage memory pools. Each storage node 20 elects a node from the storage nodes to let it assume the function of a management node. The management node can communicate with any storage node 20.
  • the storage node 20 at least includes a processor, a memory, and an IO controller 201.
  • the processor 202 is a central processing unit (CPU), which is used to process data from outside the storage node 20 or data generated inside the storage node 20.
  • Memory refers to a device used to store data. It can be a memory or a hard disk. Memory refers to the internal memory that directly exchanges data with the processor. It can read and write data at any time and is very fast. It serves as a temporary data memory for the operating system or other running programs.
  • the memory includes at least two types of memories. For example, the memory can be either a random access memory or a read only memory (ROM).
  • the random access memory may be a dynamic random access memory (Dynamic Random Access Memory, DRAM) or a storage class memory (Storage Class Memory, SCM).
  • DRAM Dynamic Random Access Memory
  • SCM Storage Class Memory
  • DRAM is a type of semiconductor memory. Like most random access memory (RAM), it is a volatile memory (volatile memory) device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory at the same time. Storage-level memory can provide faster read and write speeds than hard disks, but it is slower in computing speed than DRAM and cheaper than DRAM in cost.
  • DRAM and SCM are only exemplary descriptions in this embodiment, and the memory may also include other random access memories, such as static random access memory (SRAM) and so on.
  • SRAM static random access memory
  • the memory may also be a dual in-line memory module or a dual in-line memory module (DIMM for short), that is, a module composed of dynamic random access memory (DRAM).
  • DIMM dual in-line memory module
  • DRAM dynamic random access memory
  • the memory in this embodiment may also be a hard disk. Unlike the memory 203, the hard disk reads and writes data at a slower speed than the memory, and is usually used to store data persistently.
  • the storage node 20a As an example, one or more hard disks may be set inside; or, a hard disk enclosure (as shown in FIG. 2) may be mounted outside the storage node 20a, and multiple hard disks may be set in the hard disk enclosure. Regardless of the deployment method, these hard disks can be regarded as hard disks included in the storage node 20a.
  • the hard disk type is solid state hard disk, mechanical hard disk, or other types of hard disk.
  • other storage nodes in the storage node cluster such as storage node 20b and storage node 20c, may also include various types of hard disks.
  • One storage node 20 may contain one or more memories of the same type.
  • the hard disk included in the memory pool in this embodiment may also have a memory interface, and the processor can directly access it.
  • FIG. 2 is a schematic diagram of the internal structure of the storage node 20.
  • the storage node 20 may be a server or a storage array.
  • the storage node 20 in addition to the processor and the memory, the storage node 20 also includes an IO controller. Due to the low latency of memory access, operating system scheduling and software overhead may become the bottleneck of data processing. In order to reduce software overhead, this embodiment introduces a hardware component IO controller to hardwareize IO access and reduce the impact of CPU scheduling and software stacks.
  • the storage node 20 has its own IO controller 22, which is used to communicate with the computing node 100 and also used to communicate with other storage nodes. Specifically, the storage node 20 may receive a request from the computing node 100 through the IO controller 22, or send a request to the computing node 100 through the IO controller 22, and the storage node 20 may also send a request to the storage node 30 through the IO controller 22. Or, the request from the storage node 30 is received through the IO controller 22. Secondly, the various memories in the storage node 20 may communicate with each other through the IO controller 22, and may also communicate with the computing node 100 through the IO controller 22.
  • the hard disks included in the storage node 20 are located inside the storage node 20, these hard disks can communicate with each other through the IO controller 22, and can also communicate with the computing node 100 through the IO controller 22. If each of the hard disks is located in a hard disk frame external to the storage node 20, then the hard disk frame is provided with an IO controller 24, and the IO controller 24 is used to communicate with the IO controller 22, and the hard disk can communicate to the IO controller through the IO controller 24. 22 sends data or instructions, and can also receive data or instructions sent by the IO controller 22 through the IO controller 24.
  • the storage node 20 may also include a bus (not shown in FIG. 2) for communication between various components within the storage node 20.
  • FIG. 3 is a schematic diagram of the IO controller structure.
  • the IO controller 22 includes a communication unit 220 and a computing unit 2211.
  • the communication unit provides efficient network transmission capabilities for external or internal communication.
  • NIC network interface controller
  • the calculation unit is a programmable electronic component used to perform calculation processing on data.
  • This embodiment takes a data processing unit (DPU) as an example for description.
  • DPU has the versatility and programmability of the CPU, but it is more specialized, and can run efficiently on network data packets, storage requests or analysis requests.
  • DPU is distinguished from CPU by a large degree of parallelism (need to handle a large number of requests).
  • the DPU here can also be replaced with a processing chip such as a graphics processing unit (GPU), an embedded neural network processor (neural-network processing unit, NPU), etc.
  • the DPU is used to provide data unloading services for the memory pool, such as address index or address query functions, partition functions, and operations such as filtering and scanning data.
  • the IO request is sent to the storage node 20 through the NIC, it is directly processed on the computing unit 221, bypassing the internal CPU and operating system of the storage node 20, thinning the software stack, and reducing the impact of CPU scheduling.
  • the DPU can directly query the corresponding information of the read IO request in the index table.
  • the IO controller 22 also includes a DRAM 222.
  • the DRAM 222 is physically the same as the DRAM described in FIG. Or instruction, it does not form part of the memory pool.
  • the IO controller 22 may also map the DRAM 222 to the computing node 100, so that the space of the DRAM 222 is visible to the computing node 100, thereby converting the IO access into an access based on memory semantics.
  • the structure and function of the IO controller 24 are similar to those of the IO controller 22, and will not be described one by one here.
  • FIG. 4 is a schematic diagram of the architecture of the memory pool.
  • the memory pool contains multiple different types of memory, and each type of memory can be regarded as a level.
  • the performance of each level of memory is different from the performance of other levels of memory.
  • the performance of the memory in this application is mainly considered in terms of computing speed and/or access delay.
  • FIG. 5 is a schematic diagram of each level of memory included in the memory pool provided by this embodiment. As shown in FIG. 5, the memory pool is composed of memories in each storage node 20, where the DRAM in each storage node is located at the first level of the memory pool, because DRAM has the highest performance among various types of memory.
  • the performance of SCM is lower than that of DRAM, so the SCM in each storage node is located in the second level of the memory pool. Further, the performance of the hard disk is lower than the SCM, so the hard disk in each storage node is located in the third level of the memory pool.
  • FIG. 3 Although only three types of memory are shown in FIG. 3, according to the previous description, in product practice, a variety of different types of memory can be deployed inside the storage node 20, that is, various types of memory or hard disks can become memory pools. And the same type of memory located on different storage nodes belong to the same level in the memory pool. This application does not limit the type of memory contained in the memory pool and the number of levels. The level of the memory pool is only an internal division, which is not perceptible to upper-level applications.
  • the memory pool shown in FIG. 4 or FIG. 5 includes all types of memory in the storage node. However, in other embodiments, as shown in FIG. 6, the memory pool may only include some types of memory, for example, only higher High-performance memories, such as DRAM and SCM, exclude hard disks and other relatively low-performance memories.
  • High-performance memories such as DRAM and SCM
  • FIG. 7 for the network architecture of another memory pool provided in this embodiment.
  • storage nodes and computing nodes are integrated in the same physical device.
  • the integrated devices are collectively referred to as storage nodes.
  • the application is deployed inside the storage node 20, so the application can directly trigger a write data request or a data read request through a client in the storage node 20, which is processed by the storage node 20, or sent to other storage nodes 20 for processing.
  • the read and write data request sent by the client to the local storage node 20 specifically refers to the client sending a data access request to the processor.
  • the components and functions included in the storage node 20 are similar to those of the storage node 20 in FIG. 6, and will not be repeated here.
  • the memory pool under this network architecture can contain all types of storage in the storage node, or it can only contain some types of storage, such as only higher performance For example, DRAM and SCM, and exclude hard disks and other relatively low-performance memories (as shown in Figure 7).
  • the memory pool in this embodiment is established in at least two storage nodes, and the storage space it contains comes from at least two different types of storage.
  • the management node may also construct lower-performance memories (such as hard disks) in the storage cluster into a storage pool.
  • Figure 8 uses the network architecture shown in Figure 6 as an example to describe the storage pool. Similar to the memory pool, the storage pool shown in Figure 8 also spans at least two storage nodes, and its storage space is divided by the at least two storage nodes. One or more types of hard disks in the node.
  • the storage cluster contains both memory pool and storage pool
  • the storage pool is used to store data persistently, especially the data with low access frequency
  • the memory pool is used to store data temporarily, especially the data with high access frequency. data.
  • the storage pool can also be established in the network architecture shown in FIG. 7, and its implementation principle is similar to the above description. However, the storage pool is not the focus of the discussion in this application, and we will continue to discuss the memory pool next.
  • Each storage node 20 periodically reports the state information of the storage to the management node through the heartbeat channel.
  • the state information of the memory includes, but is not limited to: the types, health status, total capacity and available capacity of various memories included in the storage node, and so on.
  • the management node creates a memory pool based on the collected information. Creation means to centralize the storage space provided by each storage node 20 as a unified management of the memory pool.
  • the storage node 20 may selectively provide storage to the memory pool according to its own conditions, such as the health status of the storage. In other words, it is possible that some memory in some storage nodes is not part of the memory pool.
  • each space of the memory pool has a unique global address.
  • the so-called global address the space it indicates is unique in the memory pool, and each storage node 20 knows the meaning of the address.
  • the physical space is allocated to a section of the memory pool, the global address of the space has its corresponding physical address, and the physical address indicates that the space represented by the global address is actually located on which storage node and which memory, And the offset in the memory, that is, the location of the physical space.
  • Each space here refers to "page", which will be described in detail later.
  • the EC verification mechanism refers to dividing data into at least two data fragments, and calculating the verification fragments of the at least two data fragments according to a certain verification algorithm. When one of the data fragments is lost, the other data fragment can be used. A data fragment and check fragment restore data. Then, for the data, the global address is a collection of multiple fine-grained global addresses, and each fine-grained global address corresponds to a physical address of a data fragment/check fragment.
  • the multiple copy mechanism refers to storing at least two copies of the same data, and the at least two copies of data are stored in two different physical addresses.
  • its global address is also a collection of multiple finer-grained global addresses, and each fine-grained global address corresponds to a physical address of a data copy.
  • the management node may allocate physical space for each global address after the memory pool is created, and may also allocate physical space for the global address corresponding to the data write request when receiving a data write request.
  • the correspondence between each global address and its physical address is recorded in an index table, and the management node synchronizes the index table to each storage node 20.
  • Each storage node 20 stores the index table so as to query the physical address corresponding to the global address according to the index table when reading and writing data subsequently.
  • the memory pool does not directly expose its storage space to the computing node 100, but virtualizes the storage space into a logical unit (LU) for use by the computing node 100.
  • Each logical unit has a unique logical unit number (LUN). Since the computing node 100 can directly perceive the logical unit number, those skilled in the art usually directly refer to the logical unit by LUN.
  • Each LUN has a LUN ID, which is used to identify the LUN.
  • the memory pool provides storage space for the LUN with a page granularity. In other words, when the storage node 20 applies for space from the memory pool, the memory pool allocates space for it by a page or an integer multiple of the page.
  • the size of a page can be 4KB, 8KB, etc., and this application does not limit the size of the page.
  • the specific location of the data in a LUN can be determined by the start address and the length of the data. For the start address, those skilled in the art usually call it a logical block address (logical block address, LBA). It is understandable that the three factors of LUN ID, LBA, and length identify a certain address segment, and an address segment can be indexed to a global address.
  • the computing node 100 usually uses a distributed hash table (DHT) method for routing.
  • DHT distributed hash table
  • the hash ring is evenly divided into several Part, each part is called a partition, and a partition corresponds to an address segment described above.
  • the data access requests sent by the computing node 100 to the storage node 20 are all located in an address segment, such as reading data from the address segment or writing data to the address segment.
  • the computing node 100 and the storage node 20 use LUN semantics to communicate.
  • the computing node 100 and the storage node 20 use memory semantics for communication.
  • the IO controller 22 maps the space of its DRAM to the computing node 100, so that the computing node 100 can perceive the space of the DRAM (referred to as virtual space in this embodiment), and access the virtual space .
  • the read/write data request sent by the computing node 100 to the storage node 20 no longer carries the LUN ID, LBA, and length, but other logical addresses, such as the virtual space ID, the start address and length of the virtual space.
  • the IO controller 22 can map the space in the memory pool it manages to the computing node 100, so that the computing node 100 can perceive this part of the space and obtain the global address corresponding to this part of the space.
  • the IO controller 22 in the storage node 20a is used to manage the storage space provided by the storage node 20a in the memory pool
  • the IO controller 22 in the storage node 20b is used to manage the storage space provided by the storage node 20b in the memory pool.
  • the IO controller 22 in the storage node 20c is used to manage the storage space provided by the storage node 20c in the memory pool, and so on. Therefore, the entire memory pool is visible to the computing node 100, so when the computing node 100 sends the data to be written to the storage node, it can directly specify the global address of the data.
  • the application refers to the internal service of the storage node.
  • the storage node 20a generates a memory request instruction, and the memory request instruction includes the size of the requested space and the type of memory.
  • the requested space is 16KB and the memory type is SCM.
  • the size of the requested space is determined by the size of the stored data
  • the type of storage requested is determined by the hot and cold information of the data.
  • the storage node 20a obtains a free global address from the stored index table, for example, the address range is [000001-000004], where the space with the address 000001 is one page.
  • the so-called free global address means that the global address has not been occupied by any data. Then, the storage node 20a queries whether the local SCM has 16KB of free space, if so, allocates space from the local to the global address, if not, continues to query whether the SCMs of other storage nodes 20 have 16KB of free space. The steps can be implemented by sending query instructions to other storage nodes 20. Since the other storage nodes 20 and the storage node 20a are far and near in distance, in order to reduce the time delay, the storage node 20a may preferentially query the storage node 20 that is close to the storage node 20 when the storage node 20a cannot support the allocation of 16KB of free space locally.
  • the storage node 20a After obtaining the physical address, the storage node 20a records the correspondence between the global address and the physical address in an index table, and synchronizes the correspondence with other storage nodes. After the physical address is determined, the storage node 20a can use the space corresponding to the physical address to store data.
  • the application refers to the application 101 in the computing node 100.
  • the memory request instruction is generated by the computing node 100 and sent to the storage node 20a. Then, the user can specify the size of the requested space and the type of storage through the computing node 100.
  • the function of the above index table is mainly to record the correspondence between the global address and the partition ID, and the correspondence between the global address and the physical address.
  • it can also be used to record the attribute information of the data. For example, the hot and cold information or data resident strategy of the data with the global address 000001, etc. Subsequent data can be migrated between various memories according to these attribute information, or attribute settings can be performed. It should be understood that the attribute information of the data is only an optional item of the index table and does not have to be recorded.
  • the management node collects node update information, incorporates the new storage node into the memory pool, addresses the storage space contained in the storage node, and generates a new global address, and then refreshes the partition Correspondence with the global address (because the total number of partitions is unchanged whether it is expansion or contraction). Expansion is also applicable to the situation that some storage nodes have added memory or hard disks.
  • the management node regularly collects the state information of the storage contained in each storage node. If new storage is added, it will be included in the memory pool, and the new storage The space is addressed to generate a new global address, and then the correspondence between the partition and the global address is refreshed.
  • the memory pool provided in this embodiment also supports shrinking, as long as the correspondence between the global address and the partition is updated.
  • Each memory in the memory pool provided in this embodiment provides a memory interface to the processor, so that the processor sees a continuous space and can directly perform read and write operations on the memory in the memory pool.
  • a memory pool is created based on memories with multiple performances, and these memories with multiple performances are located on different storage nodes, thereby realizing a cross-node memory pool that integrates memories with different performances. It enables various types of storage (no matter it is a memory or a hard disk) to be used as storage resources to provide storage services to the upper layer, so as to better exert its performance advantages. Since the memory pool contains memories with different performances, it is possible to control the migration of data between memories with different performances based on its access frequency. It is possible to migrate to high-performance memories when the access frequency of data is high to improve the read data. Efficiency, you can also migrate to low-performance memory when the frequency of data access is low to save the storage space of high-performance memory. In addition, the memory pool in this application provides storage space for computing nodes or LUNs, which changes the processor-centric architecture of memory resources.
  • FIG. 9 is a schematic flow chart of executing the method provided in this embodiment. As shown in Figure 9, the method includes the following steps.
  • the computing node 100 sends a data write request to the storage node, where the data write request carries data to be written and a logical address of the data to be written.
  • the logical address includes LUN ID, LBA, and length.
  • the logical address includes the virtual space ID, the start address and length of the virtual space.
  • the calculation unit 221 obtains the data write request from the DRAM 222, uses the logical address as input, and outputs a key according to a certain algorithm, and the key can be uniquely located to a partition ID.
  • the calculation unit 221 queries the global address corresponding to the partition ID in the index table.
  • S104 The calculation unit 221 determines whether to allocate a physical address for the global address, and if not, execute S105: allocate a physical space for the global address, and create a correspondence between the global address and the physical address.
  • the specific allocation method can refer to the previous space allocation process. If the result of the judgment is that a physical address has been allocated to the global address, S106 is executed.
  • a multiple copy mechanism is used to ensure data reliability, it means that the data to be written needs to be stored in multiple copies in a storage cluster, and each copy is stored in a different physical address.
  • the writing process of each copy is similar, so here is an example of writing a copy.
  • the calculation unit 221 writes the data to be written into the location of the physical space indicated by the physical address.
  • the physical address indicates the storage node where the physical space is located, the memory in the storage node, and the offset in the memory, so the IO controller 22 can directly store according to the address. For example, if the physical space indicated by the physical address is located in the SCM in the storage node, the IO controller 22 executes the action of writing data. If the physical space indicated by the physical address is located in the hard disk in the storage node, the computing unit 221 notifies the communication unit 220, and the communication unit 220 sends the write data request to the IO controller 24, and the IO controller 24 executes the data. Write action.
  • the computing unit 221 notifies the communication unit 220, and the communication unit 220 sends the data to be written to other storage nodes, and instructs the node to transfer the data to be written.
  • the write data is written into the location of the physical space indicated by the physical address.
  • the calculation unit 221 obtains the data to be written in the data write request from the DRAM 222, divides the data to be written into multiple data fragments, and calculates Generate check fragments of the multiple data fragments.
  • Each data segment or check segment has its own logical address, and the logical address is a subset of the logical address carried in the data write request.
  • the calculation unit 221 uses its logical address as input and outputs a key according to a certain algorithm, and the key can be uniquely located to a partition ID.
  • the calculation unit 221 queries the global address corresponding to the partition ID in the index table, and further obtains the physical address corresponding to the global address, and then stores each data segment or check segment in the space indicated by the physical address.
  • This embodiment provides another data writing method.
  • the IO controller 22 in each storage node 20 provides the global address of the memory pool managed by it to the computing node 100, so that the computing node 100 can perceive the memory The space of the pool, and the storage node 20 is accessed through the global address.
  • the data write request sent by the computing node 100 to the storage node 20 carries the global address instead of the logical address.
  • FIG. 10 is a schematic flowchart of executing the method provided in this embodiment. As shown in Figure 10, the method includes the following steps.
  • the computing node 100 sends a data write request to the storage node 20, where the data write request carries the data to be written and the global address of the data to be written.
  • a bitmap about the global address of the memory pool is stored inside the computing node 100, and the bitmap records the global addresses corresponding to several pages in the memory pool and the usage of the pages. For example, the record corresponding to the global address of a certain page is "1", it means that the page has stored data, if the record corresponding to the global address of a certain page is "0", it means that the page has not stored data yet, and it is an idle Page.
  • the computing node can learn from the bitmap which storage space indicated by the global address has stored data, and which storage space indicated by the global address is free, and can select the free page when sending a data write request.
  • the global address is carried in the write data request. Specifically, after the storage node 20 has executed a data write request, it will send a response message to the computing node 100, and the computing node 100 may use the response message to display the global information of the page corresponding to the request in the bitmap. The address is marked (set to "1"). After the communication unit 220 of the storage node 20 makes the data write request, it stores the data write request in the DRAM 222.
  • the storage node cluster contains multiple storage nodes 20, so when the computing node 100 sends a write data request, it needs to select a specific storage node 20 according to the global address.
  • the global address corresponds to There is a physical address, and the physical address indicates which memory of which storage node the space indicated by the global address is located. Therefore, for a particular storage node 20, it can only manage the global address corresponding to the memory it owns, and perform data writing or data reading operations on the global address. If the storage node 20 receives data to be written to other storage nodes, it can be forwarded to other storage nodes, but the processing delay will be relatively long in this way.
  • one or more bytes can be embedded in the global address, and this byte is used to indicate which storage node the space indicated by the global address is located. Or perform addressing according to a certain algorithm, so that each global address corresponds to a unique storage node.
  • the computing node 100 can identify the storage node corresponding to the global address, and directly send the write request to the storage node for processing.
  • S302 The calculation unit 221 obtains the write data request from the DRAM 222, determines whether to allocate a physical address for the global address, and if not, executes S303: allocates physical space for the global address, and creates the global address and the physical address Correspondence between.
  • the specific allocation method can refer to the previous space allocation process. If the result of the judgment is that a physical address has been allocated to the global address, S304 is executed.
  • the calculation unit 221 writes the data to be written into the physical space indicated by the physical address. For this step, refer to the description of S106 in FIG. 9, which is not repeated here.
  • a multi-copy mechanism or an EC check mechanism may also be used to ensure data reliability.
  • FIG. 9 When data is first written to the storage node cluster, it is usually stored in the DRAM layer of the memory pool. As the frequency of data access gradually decreases, or the space capacity of the DRAM layer gradually decreases, the storage node cluster will internally trigger data migration, which is not perceptible to the computing node cluster.
  • the data migration strategy is stored in the management node, and the management node controls the data migration operation according to the data migration strategy.
  • the data migration strategy includes, but is not limited to, the trigger condition for executing the data migration operation, for example, it is executed periodically, or executed when a certain condition is met.
  • the certain condition here can be that the access frequency of the data is higher or lower than the set threshold, or the available capacity of the memory where the data is located is higher or lower than the set threshold, and so on.
  • control refers to: the management node instructs each storage node 20 to monitor the access frequency of its stored data, and instructs the storage node 20 to migrate the stored data between each storage.
  • the computing node 100 is sending a data write request to the storage node.
  • the data write request can carry the cold and hot information of the data (used to indicate its access frequency), and the storage node is executing all the data.
  • one execution method is to first write the data to the DRAM, and then immediately execute a data migration operation according to the cold and hot information, so as to migrate the data from the DRAM to a memory matching the cold and hot information.
  • the storage node may also obtain the cold and hot information of the data according to the metadata structure or logical address of the data, and then perform the data migration operation according to the cold and hot information.
  • Another execution method is that the storage node directly determines the memory matching the cold and heat information according to the cold and heat information, and directly writes the data into the memory through the IO controller.
  • the computing node 100 may also specify a resident policy of the data in the data write request.
  • the resident policy refers to a certain type of data that needs to be stored in a certain type of memory for a long time. For this type of data, once stored in the designated memory, no matter whether its access frequency is increased or decreased, no data migration operation will be performed on it.
  • the storage node 20a will periodically count the access frequency of the target data.
  • the access frequency is lower than the access threshold of the DRAM layer
  • the The target data is migrated to the SCM layer or other layers.
  • each layer of memory in the memory pool has an access threshold interval.
  • the access frequency of data is higher than the highest value of the interval or the access frequency of data is lower than the lowest value of the interval, it means that the data Need to be migrated to the tier that matches its access frequency.
  • Another alternative is to not set the access threshold interval of each layer of memory, and only compare the access frequency with the set access threshold.
  • the target data When it is lower than the access threshold, it means that the lower performance layer needs to be changed. .
  • the target data if the current access frequency of the target data falls within the access frequency range of the hard disk layer, first determine whether the local hard disk of the storage node 20a has free space, and if so, migrate it to The local hard disk of the storage node 20a; otherwise, the target data is sent to other storage nodes, such as the storage node 20b, and the storage node 20b is instructed to write the target data into its hard disk. Before and after the migration, the global address of the target data will not change, because the upper-layer application is not aware of it, and it is the physical address of the target data that changes. After the migration is completed, each storage node 20 updates the corresponding relationship between the global address and the physical address of the target data in its own index table.
  • Another migration strategy is to implement data migration based on the available capacity of each layer.
  • the higher-level memory has better performance and higher cost, and its storage space is more precious than lower-level memory.
  • the DRAM layer needs to migrate part of its stored data to the SCM layer or other layers in order to free up more space for the newly written data.
  • the SCM layer or other layers also have their own capacity thresholds. When the available capacity of this layer is lower than the capacity threshold, part of the stored data is migrated to other layers.
  • the memory pool provides external storage space with page granularity. Therefore, when the access frequency of the data is counted, the statistics can also be counted in the unit of page, so correspondingly, the data migration between the various levels is also realized in the unit of page.
  • applications often need to allocate smaller-grained objects, such as an item, on the page. If the page size is 4KB, then the size of a data item is 1KB, 2KB, or 3KB (as long as it is smaller than the page size). In this way, the statistics of access frequency at page granularity will inevitably have greater inaccuracy. Some data items in the page are frequently accessed, but other data items in this page are rarely accessed.
  • this embodiment also provides access frequency statistics at the granularity of data items, data migration is performed at the granularity of data items, and then hot and cold pages are aggregated, so that more efficient swap-in and swap-out performance can be achieved.
  • FIG. 11 is a schematic flowchart of a method for executing a read data request provided by this embodiment. As shown in Figure 11, the method includes the following steps.
  • the computing node 100 sends a data read request to the storage node, where the data read request carries the logical address of the data to be read, and the IO controller 22 of the storage node receives the data read request.
  • the logical address includes LUN ID, LBA, and length.
  • the logical address includes the virtual space ID, the start address and length of the virtual space.
  • the calculation unit 221 obtains the read data request from the DRAM 222, takes the logical address as input, and outputs a key according to a certain algorithm, and the key can be uniquely located to a partition ID.
  • the calculation unit 221 queries the global address corresponding to the partition ID in the index table.
  • the calculation unit 221 searches the index table for the physical address corresponding to the global address.
  • the computing unit 221 reads the data to be read from the physical space indicated by the physical address, and the communication unit 220 returns the data to be read to the computing node 100.
  • the physical address indicates the storage node where the physical space is located, the memory in the storage node, and the offset in the memory, so the calculation unit 221 can directly read according to the address. If the physical space indicated by the physical address is located on another storage node, the read data request is sent to the other storage node, and the node is instructed to read the data from the physical space indicated by the physical address.
  • the storage node can read any data copy according to the above process, and send the data copy to the computing node 100. If the EC check mechanism is adopted, the storage node needs to read each data fragment and check fragment according to the above process, merge to obtain the data to be read, and verify the data to be read, and the verification is correct Then return to the computing node 100.
  • the data reading method shown in FIG. 11 corresponds to the data writing method shown in FIG. 9, so the data reading request in this method carries the logical address of the data to be read.
  • This embodiment also Another method of reading data is provided, which corresponds to the method of writing data shown in FIG. 10. In this method, the read data request carries the global address of the data to be read, and the calculation unit 221 can directly follow According to the global address, the physical address is queried to obtain the data to be read.
  • the memory pool shown in Figure 1 to Figure 8 also supports a data prefetch mechanism.
  • Those skilled in the art can understand that the speed of reading data from a memory with higher performance is higher than the speed of reading data from a memory with lower performance. Therefore, if the data to be read hits in the higher-performance memory, it is no longer necessary to read from the lower-performance memory, so the efficiency of reading the data is higher.
  • a common practice is to read a piece of data in advance from a memory with lower performance and write it into a memory with higher performance.
  • the computing node 100 sends a read data request to read this piece of data, since the data has been read in advance to the higher-performance memory, the IO controller can directly retrieve the data from the higher-performance memory. Read the data. For a piece of data with consecutive logical addresses, it is more likely that they will be read together. Therefore, in practice, data is usually prefetched based on logical addresses. Data prefetching methods include synchronous prefetching and asynchronous prefetching.
  • Synchronous prefetching means that when a read data request is executed and the data to be read misses in the high-level memory, according to the logical address of the data to be read this time, the logical address of the data to be read will be continuous with the logical address of the data to be read. Data is read from the lower-level memory and written into the higher-level memory.
  • Asynchronous prefetching means that when a read data request is executed and the data to be read is hit in a high-level memory, the logical address of the data to be read this time will be continuous with the logical address of the data to be read Data is read from the lower-level memory and written into the higher-level memory.
  • the method for executing a data read request may further include:
  • the computing unit 221 migrates other data that is continuous with the logical address of the data to be read to a high-level memory.
  • the calculation unit 221 reads the data to be read from the physical space indicated by the physical address.
  • the data to be read may be stored in a high-level memory (such as DRAM) or may be stored in a low-level memory. In the memory (such as SCM). If the data to be read is stored in the DRAM, the calculation unit 221 hits the DRAM. If the data to be read is stored in the SCM, the calculation unit 221 misses in the DRAM. In either case, the calculation unit 221 can prefetch other data consecutive to the logical address of the data to be read into the DRAM.
  • the calculation unit 221 first obtains a logical address that is continuous with the logical address of the data to be read.
  • the logical address of the data to be read is referred to as logical address 1
  • the logical address consecutive to logical address 1 is referred to as logical address 2.
  • the calculation unit 221 takes the logical address 2 as input and outputs a key according to a certain algorithm, and the key can be uniquely located to a partition ID.
  • the calculation unit 221 queries the global address corresponding to the partition ID and the physical address corresponding to the global address in the index table.
  • the calculation unit 221 reads the other data from the physical space indicated by the physical address.
  • the other data may be located in a local storage node of the computing unit 221, or may be located in other storage nodes. If the physical space indicated by the physical address is located on another storage node, the node reads the data from the physical space indicated by the physical address.
  • the read data sent by the computing node 100 to the storage node 20 carries a global address
  • the data stored in the address consecutive to the global address is read in advance according to the global address.
  • FIG. 12 is a schematic structural diagram of a management node provided by this embodiment.
  • the management node includes a processor 401 and a memory 402.
  • a program 403 is stored in the memory 402.
  • the processor 401, the memory 402, and the interface 404 are connected through a system bus 405 and complete mutual communication.
  • the processor 401 is a single-core or multi-core central processing unit, or a specific integrated circuit, or one or more integrated circuits configured to implement the embodiments of the present invention.
  • the memory 402 may be a random access memory (Random Access Memory, RAM) or a non-volatile memory (non-volatile memory), such as at least one hard disk memory.
  • the memory 402 is used to store computer execution instructions.
  • the program 403 may be included in the computer execution instruction.
  • the processor 401 runs the program 403 to execute the following method.
  • a memory pool is created to provide a service for storing data
  • the memory pool includes the first memory and the at least two second memories; the control module is used to control the migration of the data from the first memory to The second storage or migration from the second storage to the first storage.
  • the method further includes acquiring state information of the memory, the state information including the type and capacity of the first memory, and the type and capacity of the second memory. Therefore, when creating the memory pool, the management node is specifically used to create the memory pool according to the state information.
  • the management node controls the migration of the data from the first storage to the second storage, it specifically includes: instructing the first storage node to obtain the access frequency of the data; and instructing the first storage The node migrates the data to the second memory when the access frequency is lower than the set frequency threshold.
  • FIG. 13 is a schematic diagram of another structure of the management node provided by this embodiment.
  • the management node includes a creation module 501 and a control module 502.
  • the creation module 501 is configured to create a memory pool to provide a data storage service, and the memory pool includes the first memory and the at least two second memories.
  • the control module 502 is configured to control the migration of the data from the first memory to the second memory, or from the second memory to the first memory.
  • the creation module 501 is further configured to obtain state information of the memory, where the state information includes the type and capacity of the first memory and the type and capacity of the second memory.
  • the creation module 501 is specifically configured to create the memory pool according to the state information when creating the memory pool.
  • control module 502 when controlling the migration of the data from the first storage to the second storage, is specifically configured to: instruct the first storage node to obtain the access frequency of the data; and instruct the first storage The node migrates the data to the second memory when the access frequency is lower than the set frequency threshold.
  • the functions of the creation module 501 and the control module 502 can be implemented by the processor 401 shown in FIG. 12 executing the program 403, or can be implemented by the processor 401 independently.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example: floppy disk, hard disk, tape), optical medium (for example: Digital Versatile Disc (DVD)), or semiconductor medium (for example: Solid State Disk (SSD) )Wait.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
  • “at least one” refers to one or more, and “multiple” refers to two or more.
  • “And/or” describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated object before and after is an “or” relationship; in the formula of this application, the character “/” indicates that the associated object before and after is a kind of "division" Relationship.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种存储***,内存管理方法和管理节点。该存储***包括管理节点、一个或多个第一存储器和一个或多个第二存储器。所述管理节点用于创建内存池以提供存储数据的服务。所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点。所述管理节点还用于控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。所述内存池是基于不同性能的存储器创建的,并且这些不同性能的存储器位于不同的存储节点上,从而实现了跨节点的,融合了不同性能的存储器的内存池。

Description

一种存储***、内存管理方法和管理节点 技术领域
本申请涉及存储技术领域,特别是一种存储***、内存管理方法和管理节点。
背景技术
伴随着存储级存储器(Storage Class Memory,SCM)产品的成熟,内存种类变得丰富,然而现在内存的使用通常仍局限于一个存储节点内部,并没有充分利用内存的性能优势。
发明内容
本申请第一方面提供了一种存储***,该存储***包括管理节点、一个或多个第一存储器和一个或多个第二存储器。所述管理节点用于创建内存池以提供存储数据的服务。所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点。所述管理节点还用于控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
第一方面提供的内存池至少包含了以下几种情况:1)第一存储节点包含第一存储器和第二存储器,第二存储节点也包含第一存储器和第二存储器,所有的第一存储器和所有的第二存储器都是所述内存池的一部分;2)第一存储节点仅包含第一存储器,第二存储节点仅包含第二存储器,第一存储器和第二存储器都是所述内存池的一部分;3)第一存储节点包含第一存储器和第二存储器,第二存储节点仅包含第一存储器或第二存储器中的其中一种,这些第一存储器和第二存储器均是所述内存池的一部分。除此之外,所述存储***还可以包含其他存储节点,所述其他存储节点所包含的各种类型的存储器也可以为所述内存池提供存储空间。
第一存储器和第二存储器因为类型不同而导致性能有所差异,总体而言,第一存储器的性能高于第二存储器的性能,这里的性能是指存储器的运算速度和/或访问时延。举例来说,第一存储器为动态随机存取存储器,第二存储器为储存级存储器。另外,第一方面的第一存储器和第二存储器是根据类型来划分的,例如,对于动态随机存取存储器来说,无论它位于第一存储节点还是第二存储节点,都被称为第一存储器。而对于储存级存储器,无论它位于第一存储节点还是第二存储节点,都被称为第二存储器。同理,所述内存池还可以包含第三存储器、第四存储器等等。
在第一方面的存储***中,基于多种性能的存储器创建内存池,并且这些多种性能的存储器位于不同的存储节点上,从而实现了跨节点的,融合了不同性能的存储器的内存池,使得各种类型的存储器(无论是内存还是硬盘)都能够作为存储资源为上层提供存储服务,从而更好地发挥其性能优势。由于所述内存池中包含了不同性能的存储器,所以可以控制数据基于其访问频率在不同性能的存储器之间迁移,既能够在数据的访问频率较高时迁移至高性能的存储器以提高读数据的效率,也可以在数据的访问频率较低时迁移至低性能的存储器以节省高性能的存储器的存储空间。另外本申请中的内存池为计算节点或LUN提供存储空间,它改变内存资源以处理器为中心的架构。
在一种实现中,所述管理节点还用于获取存储器的状态信息,所述状态信息包括所 述第一存储器的类型和容量,以及所述第二存储器的类型和容量。内存池的容量取决于各个存储节点提供给它的存储器的容量,内存池所包含的存储器的类型取决于各个存储节点提供给它的存储器的类型。所以管理节点在创建所述内存池之前需要从各个存储节点采集存储器的状态信息。
除此之外,所述内存池可以在可用空间不足或者在存储***发现新的存储节点时进行扩容,扩容时管理节点也需要从新的存储节点采集存储器的状态信息。
在一种实现中,所述内存池的存储空间包括若干个页,一个页在内存池中的位置被称为全局地址,所述全局地址映射到所述页的物理地址,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。举例来说,所述页的尺寸为4KB,或者8KB,或者16KB等等。页的尺寸可以是固定大小,也可以根据需求配置成不同大小。具有不同大小的页的内存池,在使用时会更加灵活。
在一种实现中,所述第一存储节点和所述第二存储节点都保存有索引表,所述索引表用于记录所述页的全局地址与所述页的物理地址的映射关系。
在一种实现中,可以预先为某些页的全局地址分配好物理地址,将全局地址和物理地址的映射关系记录在所述索引表中,那么当存储节点接收到针对这些页的写数据请求时,就可以直接根据所述索引表查询到物理地址,从而将待写入数据写入所述物理地址指示的物理空间中。采用预先分配的方式,在执行写数据请求的时候可以直接将数据写入预先分配的物理空间中,由此提高写数据的效率。
在一种实现中,可以不预先为页分配物理地址,而是在存储节点接收到针对这些页的写数据请求时,再从存储器中分配物理空间,将待写入数据写入所述物理空间,并且将所述页的全局地址与物理地址的映射关系记录在所述索引表中。这种按需分配的方式,可以更加灵活地分配物理空间,需要多少就分多少,达到节省空间的目的。
任意一个存储节点对索引表有更新时,都可以将更新的内容发送给其他节点以及管理节点。使得每个存储节点都拥有完整的索引表,从而避免了信息不透明的问题。因为每个存储节点都可能接收到计算节点的读数据请求,那么就可以通过所述索引表查询到待读取数据的物理地址,获得正确的数据。
在第一方面的存储***中,关于写数据请求的执行,本申请提供了至少三种实现。
在一种实现中,第一存储节点包括IO控制器,所述IO控制器存储有所述索引表,所述IO控制器与计算节点通信,所述IO控制器存储有所述索引表,所述IO控制器与计算节点通信。所述IO控制器用于接收所述计算节点发送的第一数据和所述第一数据的第一逻辑地址,根据所述第一逻辑地址确定所述第一数据在所述内存池中的第一全局地址,根据所述索引表确定是否已经为所述第一全局地址分配有物理空间,当确定已经为所述第一全局地址分配有物理空间时,将所述第一数据写入所述第一物理地址指示的物理空间中。在这种实现中,第一存储节点和计算节点之间采用LUN语义进行通信,所述第一逻辑地址是指LUN ID,LBA和长度。
相应的,在上述实现中,如果所述IO控制器用于接收所述计算节点发送的读数据请求以读取所述第一数据,所述读数据请求携带所述第一逻辑地址,那么所述IO控制器根据所述第一逻辑地址确定所述第一数据在所述内存池中的第一全局地址,并根据所述索引表确定所述第一全局地址对应的第一物理地址,从所述第一物理地址指示的物理空间读取所述第一数据。
在一种实现中,第一存储节点和计算节点之间采用内存语义进行通信,此时,第一逻辑地址是指虚拟空间的ID,起始地址和长度。除此之外,写数据请求的执行过程和上述实现流程一致。读数据请求的执行过程也同样如此。
在一种实现中,存储***中的各个存储节点将自己本地的那部分内存池映射给计算节点,由此计算节点可以“看到”所述内存池,并获得所述内存池的全局地址。那么计算节点发送给第一存储节点的写数据请求就可以直接携带全局地址,第一存储节点中的IO控制器根据所述索引表确定所述全局地址对应的物理地址,将数据写入所述物理地址指示的物理空间中。相应的,在这种实现中,如果计算节点给第一存储节点发送读数据请求,所述读数据请求也携带的是待读取数据的全局地址,IO控制器可以直接根据索引表确定全局地址对应的物理地址。按照这种实现方式,写数据和读数据的效率都可以提高,因为减少了逻辑地址转换为全局地址的步骤。
在一种实现中,管理节点在控制数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移时具体用于,指示各个存储节点监控其本地的数据的访问频率,指示该存储节点在访问频率高于一定阈值时将该数据迁移至性能更高的存储器中;或者在访问频率低于一定阈值时,指示该存储节点将该数据迁移至性能更低的存储器中。注意,这里的迁移可以是在存储节点内迁移,也可以是跨节点的迁移。
在上种实现中,可以以页为粒度进行访问频率监控,并且以页为粒度进行数据迁移。也可以以数据项为粒度进行访问频率监控,并且以数据项为粒度进行数据迁移。数据项是比页粒度更小的单元,以数据项为粒度进行监控可以更加准确地对热点数据(或者非热点)进行识别,并且将真正的热点数据迁移至性能更高的存储器中提高读数据的效率。
在一种实现中,第一方面提供的内存还支持数据预取。具体的,在上述任意一种读数据请求的执行过程中,或者在读数据请求完成之后,将与待读取数据相关联的其他数据预取至较高性能的存储器中,相关联的数据是指逻辑地址与待读取数据的逻辑地址连续的数据。数据预取可以提高高性能的存储器的命中率,提高读数据的效率。
在一种实现中,第一方面的存储***适用于计算与存储分离的场景。换言之,计算节点独立于***中的任意一个存储节点。计算节点和存储节点的通信是通过外部网络来实现的,这种方式的好处是扩容比较方便。例如,当计算资源不足时,可以增加计算节点的数量,存储节点的数量不变。当存储资源不足时,可以增加存储节点的数量,计算节点的数量不变。
在一种实现中,第一方面的存储***适用于计算与存储融合的场景。换言之,一个计算节点和一个存储节点是集成在同一个物理设备中的,我们可以把这个集成了计算和存储的物理设备叫做存储服务器,或者也称作存储节点。在这种情况下,计算节点和存储节点的通信是通过内部网络来实现的,那么在执行读数据请求或写数据请求时,访问时延比较低。
第二方面提供了一种内存管理方法,该方法应用于存储***中,所述存储***中的管理节点或者存储节点执行其中的方法步骤,以实现第一方面的功能。
第三方面提供了一种管理节点,所述管理节点位于第一方面提供的存储***中,所述存储***包括一个或多个第一存储器和一个或多个第二存储器。所述管理节点包括创建模块控制模块。所述创建模块用于创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能 高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点。所述控制模块用于控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
第三方面提供的管理节点基于多种性能的存储器创建内存池,并且这些多种性能的存储器位于不同的存储节点上,从而实现了跨节点的,融合了不同性能的存储器的内存池,使得各种类型的存储器(无论是内存还是硬盘)都能够作为存储资源为上层提供存储服务,从而更好地发挥其性能优势。由于所述内存池中包含了不同性能的存储器,所以可以控制数据基于其访问频率在不同性能的存储器之间迁移,既能够在数据的访问频率较高时迁移至高性能的存储器以提高读数据的效率,也可以在数据的访问频率较低时迁移至低性能的存储器以节省高性能的存储器的存储空间。另外本申请中的内存池为计算节点或LUN提供存储空间,它改变内存资源以处理器为中心的架构。
在一种实现中,所述创建模块还用于获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量。所述创建模块具体用于根据所述状态信息创建所述内存池。
在一种实现中,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。
在一种实现中,所述数据存储在所述第一存储器,所述控制模块具体用于指示所述第一存储节点获取所述数据的访问频率;以及指示所述第一存储节点在所述访问频率低于设定的频率阈值时将所述数据迁移至所述内存池中的第二存储器。
第四方面提供了一种管理节点,所述管理节点位于第一方面提供的存储***中,所述存储***包括一个或多个第一存储器和一个或多个第二存储器。所述管理节点包括接口和处理器。所述处理器用于创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点。所述处理器还用于控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。所述接口用于与所述第一存储节点以及所述第二存储节点通信。
第四方面提供的管理节点基于多种性能的存储器创建内存池,并且这些多种性能的存储器位于不同的存储节点上,从而实现了跨节点的,融合了不同性能的存储器的内存池,使得各种类型的存储器(无论是内存还是硬盘)都能够作为存储资源为上层提供存储服务,从而更好地发挥其性能优势。由于所述内存池中包含了不同性能的存储器,所以可以控制数据基于其访问频率在不同性能的存储器之间迁移,既能够在数据的访问频率较高时迁移至高性能的存储器以提高读数据的效率,也可以在数据的访问频率较低时迁移至低性能的存储器以节省高性能的存储器的存储空间。另外本申请中的内存池为计算节点或LUN提供存储空间,它改变内存资源以处理器为中心的架构。
在一种实现中,所述处理器还用于通过所述接口获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量。所述处理 器具体用于根据所述状态信息创建所述内存池。
在一种实现中,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。
在一种实现中,所述数据存储在所述第一存储器,所述处理器具体用于指示所述第一存储节点获取所述数据的访问频率;以及指示所述第一存储节点在所述访问频率低于设定的频率阈值时将所述数据迁移至所述内存池中的第二存储器。
第五方面提供了一种计算机可读存储介质,所述存储介质中存储有程序指令,所述程序指令用于执行以下方法:创建内存池以提供存储数据的服务,所述内存池包括一个或多个第一存储器以及一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;以及控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
在一种实现中,还包括获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量。所述创建内存池具体包括:基于所述状态信息创建所述内存池。
在一种实现中,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。
在一种实现中,所述控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移具体包括:指示所述第一存储节点获取所述数据的访问频率;指示所述第一存储节点在所述访问频率低于设定的频率阈值时将所述数据迁移至所述第二存储器。
第六方面提供了一种计算程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被运行时,使得上述各方面中由管理节点或者计算节点执行的方法被执行。
附图说明
图1为本实施例提供的存储***的***架构示意图;
图2是本实施例提供的存储节点的结构示意图;
图3是本实施例提供的IO控制器的结构示意图;
图4是本实施例提供的内存池的架构示意图;
图5是本实施例提供的内存池所包含各层级的存储器的示意图;
图6是本实施例提供的另一种内存池的架构示意图;
图7是本实施例提供的另一种内存池的架构示意图;
图8是本实施例提供的存储池的架构示意图;
图9是本实施例提供的写数据方法的流程示意图;
图10是本实施例提供的另一种写数据方法的流程示意图;
图11是本实施例提供的读数据方法的流程示意图;
图12是本实施例提供的管理节点的一种结构示意图;
图13是本实施例提供的管理节点的另一种结构示意图。
具体实施方式
下面将结合附图,对本发明实施例中的技术方案进行描述。
本发明实施例描述的网络架构以及业务场景是为了更加清楚的说明本发明实施例的技术方案,并不构成对于本发明实施例提供的技术方案的限定,本领域普通技术人员可知,随着网络架构的演变和新业务场景的出现,本发明实施例提供的技术方案对于类似的技术问题,同样适用。
本实施例提供的存储***包括计算节点集群和存储节点集群。其中,计算节点集群包括一个或多个计算节点100(图1中示出了两个计算节点100,但不限于两个计算节点100)。计算节点100是用户侧的一种计算设备,如服务器、台式计算机等。在硬件层面,计算节点100中设置有处理器和内存(图1中未示出)。在软件层面,计算节点100上运行有应用程序(application)101(简称应用)和客户端程序102(简称客户端)。应用101是对用户呈现的各种应用程序的统称。客户端102用于接收由应用101触发的数据访问请求,并且与存储节点20交互,向存储节点20发送所述数据访问请求。客户端102还用于接收来自存储节点的数据,并向应用101转发所述数据。可以理解的是,当客户端102是软件程序时,客户端102的功能由计算节点100所包含的处理器运行内存中的程序来实现。客户端102也可以由位于计算节点100内部的硬件组件来实现。计算节点集群中的任意一个客户端102可以访问存储节点集群中的任意一个存储节点20。
存储节点集群包括一个或多个存储节点20(图1中示出了三个存储节点20,但不限于三个存储节点20),各个存储节点20之间可以互联。存储节点如服务器、台式计算机或者存储阵列的控制器、硬盘框等。在功能上,存储节点20主要用于对数据进行计算或处理等。另外,所述存储节点集群还包括管理节点(图1未示出)。管理节点用于创建并管理内存池。各个存储节点20从存储节点中选举出一个节点让它承担管理节点的职能。管理节点可以与任意一个存储节点20通信。
在硬件上,如图1所示,存储节点20至少包括处理器、存储器和IO控制器201。其中,处理器202是中央处理器(central processing unit,CPU),用于处理来自存储节点20外部的数据,或者存储节点20内部生成的数据。存储器,是指用于存储数据的装置,它可以是内存,也可以是硬盘。内存是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作***或其他正在运行中的程序的临时数据存储器。内存包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(Read Only Memory,ROM)。举例来说,随机存取存储器可以是动态随机存取存储器(Dynamic Random Access Memory,DRAM),也可以是存储级存储器(Storage Class Memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(Random Access Memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但运算速度上比DRAM慢,在成本上也比DRAM更为便宜。
然而,DRAM和SCM在本实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(Static Random Access Memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(Programmable Read Only Memory, PROM)、可抹除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)等。另外,内存还可以是双列直插式存储器模块或双线存储器模块(Dual In-line Memory Module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块。在图2以及后面的描述中,均以DRAM和SCM为例进行说明,但不代表存储节点20不包含其他类型的存储器。
本实施例中的存储器还可以是硬盘。与内存203不同的是,硬盘读写数据的速度比内存慢,通常用于持久性地存储数据。以存储节点20a为例,其内部可以设置一个或多个硬盘;或者,在存储节点20a的外部还可以挂载一个硬盘框(如图2所示),在硬盘框中设置多个硬盘。无论哪一种部署方式,这些硬盘都可以视作存储节点20a所包含的硬盘。硬盘类型为固态硬盘、机械硬盘,或者其他类型的硬盘。类似的,存储节点集群中的其他存储节点,如存储节点20b、存储节点20c也可以包含各种类型的硬盘。一个存储节点20中可以包含一个或多个同一种类型的存储器。
本实施例中的内存池中所包括的硬盘,也可以具有内存接口,处理器可以直接对其进行访问。
请参考图2,图2是存储节点20的内部结构示意图。在实际应用中,存储节点20可以是服务器也可以是存储阵列。如图2所示,除了处理器和存储器之外,存储节点20还包括IO控制器。由于内存的访问时延很低,操作***调度以及软件本身的开销就可能成为数据处理的瓶颈。为了减少软件开销,本实施例引入硬件组件IO控制器,用于将IO的访问进行硬件化,降低CPU调度和软件栈的影响。
首先,存储节点20有自己的IO控制器22,用于与计算节点100通信,还用于与其他存储节点通信。具体的,存储节点20可通过IO控制器22接收来自计算节点100的请求,或者通过IO控制器22向计算节点100发送请求,存储节点20也可以通过IO控制器22向存储节点30发送请求,或者通过IO控制器22接收来自存储节点30的请求。其次,存储节点20内的各个内存之间可以通过IO控制器22通信,也可以通过IO控制器22与计算节点100通信。最后,如果存储节点20所包含的各个硬盘位于存储节点20内部,那么这些硬盘之间可以通过IO控制器22通信,也可以通过IO控制器22与计算节点100通信。如果所述各个硬盘位于存储节点20外接的硬盘框中,那么硬盘框中设置有IO控制器24,IO控制器24用于与IO控制器22通信,硬盘可以通过IO控制器24向IO控制器22发送数据或指令,也可以通过IO控制器24接收IO控制器22发送的数据或指令。另外,存储节点20还可以包括总线(图2中未示出)用于存储节点20内部各组件之间的通信。
请参考图3,图3是IO控制器的结构示意图,以IO控制器22为例,它包括通信单元220和计算单元2211,其中,通信单元提供高效的网络传输能力,用于外部或内部通信,这里以网络接口控制器(network interface controller,NIC)为例。计算单元是一个可编程的电子部件,用于对数据进行计算处理等,本实施例以数据处理单元(data processing unit,DPU)为例予以说明。DPU具有CPU的通用性和可编程性,但更具有专用性,可以在网络数据包,存储请求或分析请求上高效运行。DPU通过较大程度的并行性(需要处理大量请求)与CPU区别开来。可选的,这里的DPU也可以替换成图形处理单元(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)等处理芯片。DPU用于提供针对内存池的数据卸载服务,例如地址索引或地址查询功能, 分区功能以及对数据进行过滤、扫描等操作。当IO请求通过NIC进行存储节点20之后,直接在计算单元221上进行处理,绕过了存储节点20内部的CPU和操作***,压薄了软件栈,减少了CPU调度的影响。以执行读IO请求为例,当NIC接收计算节点100发送的所述读IO请求后,DPU可直接在索引表中查询所述读IO请求所述对应的信息。另外IO控制器22还包括DRAM 222,DRAM 222在物理上和图2所描述的DRAM一致,只是这里的DRAM 222是属于IO控制器22自己的内存,用于临时存储经过IO控制器22的数据或指令,它不构成内存池的一部分。另外,IO控制器22还可以将DRAM 222映射至计算节点100,使得DRAM 222的空间对计算节点100可见,由此将IO访问转化为基于内存语义的访问。IO控制器24的结构与功能与IO控制器22类似,这里不再一一展开描述。
下面介绍本实施例提供的内存池,图4是内存池的架构示意图。所述内存池中包含有多种不同类型的存储器,每一种类型的存储器可以看作一个层级。每个层级的存储器的性能与其他层级的存储器的性能不同。本申请中的存储器的性能主要从运算速度和/或访问时延等方面进行考量。图5是本实施例提供的内存池所包含各层级的存储器的示意图。由图5所示,内存池由各个存储节点20中的存储器组成,其中,各存储节点中的DRAM位于所述内存池的第一层级,因为DRAM在各种类型的存储器性能最高。SCM的性能低于DRAM,因此各个存储节点中的SCM位于所述内存池的第二层级。进一步地,硬盘的性能低于SCM,所以各个存储节点中的硬盘位于所述内存池的第三层级。虽然图3中只示出了三种类型的存储器,但是按照前面的描述,在产品实践中存储节点20内部可部署多种不同类型的存储器,即各种类型的内存或硬盘均可以成为内存池的一部分,并且位于不同存储节点上的相同类型的存储器在所述内存池中属于同一层级。本申请不对内存池所包含的存储器的类型,以及层级的数量做任何限定。内存池的层级只是内部的划分,对上层应用来说是不感知的。需要注意的是,虽然各个存储节点上的同一种存储器处于同一层级,但是对某一个存储节点来说,使用其本地的DRAM的性能高于使用其他存储节点的DRAM的性能,同理,使用本地的SCM的性能高于使用其他存储节点的SCM的性能,等等。因此,对该存储节点来说,当需要分配某一层级的内存空间时,会优先分配其本地属于该层级的空间给用户,当本地空间不足时再从其他存储节点的同一层级中分配。
在图4或图5所示的内存池包含了存储节点中的所有类型的存储器,然而在其他实施方式中,如图6所示,内存池可以只包含部分类型的存储器,例如只包含较高性能的存储器,例如DRAM和SCM,而排除硬盘等性能相对较低的存储器。
本实施例提供的另一种内存池的网络架构可参考图7,如图7所示,在这种网络架构中,存储节点和计算节点集成在同一个物理设备中,本实施例中将所述集成的设备统一称为存储节点。应用部署在存储节点20内部,所以应用可直接通过存储节点20中的客户端触发写数据请求或读数据请求,由该存储节点20处理,或者发送给其他存储节点20处理。此时,客户端向本地存储节点20发送的读写数据请求具体是指客户端向处理器发送数据访问请求。除此之外,存储节点20所包含的组件及其功能与图6中的存储节点20类似,这里不再赘述。与图4至图6任一所示的内存池类似,这种网络架构下的内存池可以包含了存储节点中的所有类型的存储器,也可以只包含部分类型的存储器,例如只包含较高性能的存储器,例如DRAM和SCM,而排除硬盘等性能相对较低的存储器(如 图7所示)。
另外,在图4至图7所示的内存池中,并非存储节点集群中的每一个存储节点都必须为所述内存池贡献存储空间,所述内存池可以仅覆盖集群中的部分存储节点。在某些应用场景中,存储节点集群中还可以创建两个或两个以上的内存池,每个内存池覆盖多个存储节点,由这些存储节点为该内存池提供存储空间。不同内存池所述占用的存储节点可以重复也可以不重复。总而言之,本实施例中的内存池是在至少两个存储节点中建立的,其包含的存储空间来源于至少两种不同类型的存储器。
当内存池仅包含存储集群中的较高性能的存储器(例如DRAM和SCM)时,管理节点还可以将存储集群中的较低性能的存储器(例如硬盘)构建成一个存储池。图8以图6所示的网络架构为例对存储池进行了描述,与内存池类似,图8所示的存储池也跨越了至少两个存储节点,其存储空间由所述至少两个存储节点中的一种或多种类型的硬盘构成。当存储集群中既包含内存池又包含存储池时,存储池用于持久化地存储数据,特别是访问频率较低的数据,而内存池则用于临时存储数据,特别是访问频率较高的数据。具体的,当内存池中存储的数据的数据量到达设定的阈值时,所述内存池中的部分数据将会被写入存储池中存储。可以理解的是,存储池也可以建立在图7所示的网络架构中,其实现原理与上面的描述类似。然而,存储池并非本申请讨论的重点,接下来继续讨论内存池。
关于内存池的创建。各个存储节点20通过心跳通道定期向管理节点上报存储器的状态信息。管理节点可以有一个,也可以部署多个。它可以作为独立节点部署在存储节点集群中,也可以和存储节点20联合部署。换言之,由某一个或多个存储节点20承担管理节点的职能。存储器的状态信息包括但不限于:该存储节点所包含的各种存储器的类型、健康状态、每种存储器的总容量以及可用容量等等。管理节点根据采集的信息创建内存池,创建意味着将各个存储节点20所提供的存储空间集中起来作为内存池统一管理,因此内存池的物理空间来源于各个存储节点所包含的各种存储器。然而,在某些场景下,存储节点20可以根据自身情况,例如存储器的健康状态,选择性地提供存储器给内存池。换言之,有可能某些存储节点中的某些存储器并非内存池的一部分。
采集完信息之后,管理节点需要对被纳入所述内存池的存储空间进行统一编址。经过统一编址,所述内存池的每段空间都有一个唯一的全局地址。所谓全局地址,它所指示的空间在内存池中是唯一的,并且每个存储节点20都知道该地址的含义。在给内存池的一段空间分配了物理空间之后,该空间的全局地址就拥有了其对应的物理地址,所述物理地址指示了该全局地址所代表的空间实际位于哪个存储节点的哪个存储器上,以及在该存储器中的偏移量,即物理空间的位置。这里的每段空间是指“页”,后面会详细介绍。在实际应用中,为了保证数据的可靠性,往往采用纠删码(erasure coding,EC)校验机制或者多副本机制来实现数据冗余。EC校验机制是指将数据划分为至少两个数据分片,按照一定的校验算法计算所述至少两个数据分片的校验分片,当其中一个数据分片丢失时,可以利用另一个数据分片以及校验分片恢复数据。那么,对于所述数据而言,其全局地址是多个细粒度的全局地址的集合,每个细粒度的全局地址对应了一个数据分片/校验分片的物理地址。多副本机制是指存储至少两份相同的数据副本,并且,这至少两份数据副本存储在两个不同的物理地址中。当其中一份数据副本丢失时,可以使用其 他数据副本恢复。因此,对于所述数据而言,其全局地址也是多个更细粒度的全局地址的集合,每个细粒度的全局地址对应了一个数据副本的物理地址。
管理节点可以在创建内存池之后为各个全局地址分配物理空间,也可以在接收写数据请求时为所述写数据请求所对应的全局地址分配物理空间。各个全局地址与其物理地址之间的对应关系被记录在索引表中,管理节点将所述索引表同步给各个存储节点20。各个存储节点20存储所述索引表,以便后续读写数据时根据索引表查询全局地址对应的物理地址。
在某些应用场景中,内存池并非直接将其存储空间暴露给计算节点100,而是将存储空间虚拟化为逻辑单元(logical unit,LU)提供给计算节点100使用。每个逻辑单元具有唯一的逻辑单元号(logical unit number,LUN)。由于计算节点100能直接感知到逻辑单元号,本领域技术人员通常直接用LUN代指逻辑单元。每个LUN具有LUN ID,用于标识所述LUN。此时,内存池以页为粒度为LUN提供存储空间,换句话说,当存储节点20向内存池申请空间时,内存池以一个页或者页的整数倍为其分配空间。一个页的大小可以是4KB,也可以是8KB等等,本申请不对页的大小进行限定。数据位于一个LUN内的具***置可以由起始地址和该数据的长度(length)确定。对于起始地址,本领域技术人员通常称作逻辑块地址(logical block address,LBA)。可以理解的是,LUN ID、LBA和length这三个因素标识了一个确定的地址段,一个地址段可以索引到一个全局地址。为了保证数据均匀存储在各个存储节点20中,计算节点100通常采用分布式哈希表(Distributed Hash Table,DHT)方式进行路由,按照分布式哈希表方式,将哈希环均匀地划分为若干部分,每个部分称为一个分区,一个分区对应一个上面描述的地址段。计算节点100向存储节点20发送的数据访问请求,都会被定位到一个地址段上,例如从所述地址段上读取数据,或者往所述地址段上写入数据。
在上面描述的应用场景中,计算节点100与存储节点20之间利用LUN语义进行通信。在另一种应用场景中,计算节点100与存储节点20之间利用内存语义进行通信。此时,IO控制器22将其DRAM的空间映射给计算节点100,使得计算节点100可以感知到所述DRAM的空间(本实施例中将其称为虚拟空间),对所述虚拟空间进行访问。在这种场景中,计算节点100发送给存储节点20的读/写数据请求不再携带LUN ID、LBA和length,而是其他逻辑地址,例如虚拟空间ID、虚拟空间的起始地址以及长度。在另一种应用场景中,IO控制器22可以将它管理的内存池中的空间映射给计算节点100,使得计算节点100可以感知到这部分空间,并且获得这部分空间所对应的全局地址。例如,存储节点20a中的IO控制器22用于管理内存池中由存储节点20a提供的存储空间,存储节点20b中的IO控制器22用于管理内存池中由存储节点20b提供的存储空间,存储节点20c中的IO控制器22用于管理内存池中由存储节点20c提供的存储空间,等等。因此整个内存池对计算节点100来说是可见的,那么计算节点100在向存储节点发送待写入数据时,可以直接指定该数据的全局地址。
下面以应用向内存池申请存储空间为例说明空间分配流程。一种情况下,应用是指存储节点的内部服务,例如,存储节点20a内部生成一个内存申请指令,所述内存申请指令包括申请的空间大小以及存储器的类型。为了便于理解,这里假设申请的空间为16KB,存储器的类型为SCM。简而言之,申请的空间大小是由存储的数据的大小决定的,而申请的存储器的类型是由该数据的冷热信息决定的。存储节点20a从存储的索引表中 获取一段空闲的全局地址,例如地址区间为[000001-000004],其中,地址为000001的空间为一个页。所谓空闲的全局地址,是指该全局地址尚未被任何数据占用。然后,存储节点20a查询本地的SCM是否拥有16KB的空闲空间,若有,则从本地分配空间给所述全局地址,若否,则继续查询其他存储节点20的SCM是否拥有16KB的空闲空间,该步骤可以通过向其他存储节点20发送查询指令来实现。由于其他存储节点20与存储节点20a在距离上有远近之分,为了降低时延,存储节点20a在本地不能支撑分配16KB的空闲空间的情况下,可以优先向距离近的存储节点20查询。待获得物理地址之后,存储节点20a将所述全局地址与所述物理地址的对应关系记录在索引表中,并将所述对应关系同步到其他存储节点。在确定物理地址之后,存储节点20a就可以使用所述物理地址对应的空间存储数据了。另一种情况下,应用是指计算节点100中的应用101,这种情况下,内存申请指令则是计算节点100生成之后发送给存储节点20a的。那么,用户可以通过计算节点100指定申请的空间大小以及存储器的类型。
上述索引表的作用主要在于记录全局地址与分区ID的对应关系、以及全局地址与物理地址的对应关系,除此之外,还可以用于记录数据的属性信息。例如,全局地址为000001的数据所具有的冷热信息或数据常驻策略等等。后续可以根据这些属性信息实现数据在各种存储器之间的迁移,或者进行属性设置等。应理解,数据的属性信息只是索引表的一个可选项,并非必须记录。
有新的存储节点加入存储节点集群时,管理节点收集节点更新信息,将新的存储节点纳入内存池,对该存储节点所包含的存储空间进行编址,从而生成新的全局地址,再刷新分区与全局地址之间的对应关系(因为无论是扩容还是缩容,分区的总数是不变的)。扩容也同样适用于某些存储节点增加了内存或硬盘的情况,管理节点定期收集各个存储节点所包含的存储器的状态信息,如果的新的存储器加入,将其纳入内存池,并对新的存储空间进行编址,从而生成新的全局地址,再刷新分区与全局地址之间的对应关系。同理,本实施例提供的内存池也支持缩容,只要更新全局地址与分区的对应关系即可。
本实施例提供的内存池中的各个存储器均提供内存接口给处理器,使得处理器看到的是一段连续的空间,可以直接对内存池中的存储器进行读写操作。
在本实施例的存储***中,基于多种性能的存储器创建内存池,并且这些多种性能的存储器位于不同的存储节点上,从而实现了跨节点的,融合了不同性能的存储器的内存池,使得各种类型的存储器(无论是内存还是硬盘)都能够作为存储资源为上层提供存储服务,从而更好地发挥其性能优势。由于所述内存池中包含了不同性能的存储器,所以可以控制数据基于其访问频率在不同性能的存储器之间迁移,既能够在数据的访问频率较高时迁移至高性能的存储器以提高读数据的效率,也可以在数据的访问频率较低时迁移至低性能的存储器以节省高性能的存储器的存储空间。另外本申请中的内存池为计算节点或LUN提供存储空间,它改变内存资源以处理器为中心的架构。
下面介绍执行一种写数据方法的过程。请参考图9,图9是本实施例提供的执行该方法的流程示意图。如图9所示,该方法包括如下步骤。
S101,计算节点100发送给存储节点一个写数据请求,所述写数据请求中携带待写入数据以及所述待写入数据的逻辑地址。在LUN语义的应用场景中,所述逻辑地址包括LUN ID、LBA和length,在内存语义的应用场景中,所述逻辑地址包括虚拟空间ID、虚拟空间的起始地址和length。存储节点的通信单元220接收所述写数据请求后,将所述写 数据请求存储在DRAM 222中。
S102,计算单元221从DRAM 222中获取所述写数据请求,以所述逻辑地址为输入,按照一定算法输出key,所述key可以唯一定位到一个分区ID。
S103,计算单元221在索引表中查询所述分区ID对应的全局地址。
S104,计算单元221判断是否为所述全局地址分配物理地址,若否执行S105:为所述全局地址分配物理空间,并创建所述全局地址与物理地址之间的对应关系。具体的分配方式可参考前面的空间分配流程。若判断的结果为已经为所述全局地址分配了物理地址,则执行S106。
如果采用多副本机制保证数据可靠性,那么意味着所述待写入数据需要存储多份副本在存储集群中,每份副本存储在不同的物理地址。每份副本的写入过程类似,所以这里以写入一份副本为例予以说明。
S106,计算单元221将所述待写入数据写入所述物理地址指示的物理空间的位置中。所述物理地址指示了所述物理空间位于的存储节点,所述存储节点内的存储器,以及所述存储器内的偏移量,所以IO控制器22可以直接根据该地址进行存储。举例来说,如果所述物理地址指示的物理空间位于存储节点内的SCM中,由IO控制器22执行数据写入的动作。如果所述物理地址指示的物理空间位于存储节点内的硬盘中,那么计算单元221通知通信单元220,由通信单元220将所述写数据请求发送给IO控制器24,由IO控制器24执行数据写入的动作。如果所述物理地址所指示的物理空间位于其他存储节点上,则计算单元221通知通信单元220,由通信单元220将所述待写入数据发送给其他存储节点,并指示该节点将所述待写入数据写入所述物理地址指示的物理空间的位置中。
如果采用EC校验机制,那么在上述流程中,计算单元221从DRAM 222获取所述写数据请求中的所述待写入数据,将所述待写入数据划分成多个数据分片,计算生成所述多个数据分片的校验分片。每个数据分片或者校验分片都具有自己的逻辑地址,所述逻辑地址是所述写数据请求中携带的逻辑地址的子集。对于每个数据分片/校验分片,计算单元221以它的逻辑地址为输入,按照一定算法输出key,所述key可以唯一定位到一个分区ID。计算单元221在索引表中查询所述分区ID对应的全局地址,并进一步获取全局地址对应的物理地址,再将每个数据分片或校验分片存储在物理地址指示的空间的位置中。
本实施例提供的另一种写数据方法,在该方法中,各个存储节点20中的IO控制器22将其管理的内存池的全局地址提供给计算节点100,使得计算节点100可以感知到内存池的空间,并且通过全局地址来访问存储节点20。此时,计算节点100发送给存储节点20的写数据请求中携带的是全局地址,而非逻辑地址了。请参考图10,图10是本实施例提供的执行该方法的流程示意图。如图10所示,该方法包括如下步骤。
S301,计算节点100发送给存储节点20写数据请求,所述写数据请求中携带待写入数据以及所述待写入数据的全局地址。在计算节点100内部保存有一张关于内存池的全局地址的位图,所述位图记录了内存池中若干个页所对应的全局地址,以及该页的使用情况。例如某个页的全局地址对应的记录是“1”,那么代表该页已经存储了数据,如果某个页的全局地址对应的记录是“0”,那么代表该页尚未存储数据,是一个空闲页。因此,计算节点可以根据所述位图获知哪些全局地址所指示的存储空间已经存储有数据 了,以及哪些全局地址所指示的存储空间是空闲的,在发送写数据请求时可以选择空闲的页的全局地址,将其携带在所述写数据请求中。具体的,存储节点20在执行完毕一个写数据请求之后,会给计算节点100发送一个响应消息,计算节点100可以根据所述响应消息,在所述位图中对该请求所对应的页的全局地址进行标记(设置为“1”)。存储节点20的通信单元220所述写数据请求后,将所述写数据请求存储在DRAM 222中。
另外,由图1所示,存储节点集群中包含多个存储节点20,那么计算节点100在发送写数据请求时需要根据全局地址选择一个特定的存储节点20,由前面的描述可知,全局地址对应有物理地址,物理地址指明了所述全局地址指示的空间位于哪个存储节点的哪个存储器上。因此,对某一个特定的存储节点20来说,它只能管理它自己所拥有的存储器所对应的全局地址,对所述全局地址执行写入数据或读取数据的操作。如果该存储节点20接收到的待写入其他存储节点的数据,可以将其转发给其他存储节点,但这样处理时延会比较大。
为了降低访问时延,管理节点对内存池进行编址时,可以在全局地址中嵌入一个或多个字节,该字节用于指示全局地址所指示的空间位于哪个存储节点。或者根据一定算法进行编址,使得每个全局地址对应一个唯一的存储节点。由此,计算节点100可以识别出全局地址所对应的存储节点,直接将写请求发送给该存储节点处理。
S302,计算单元221从DRAM 222中获取所述写数据请求,判断是否为所述全局地址分配物理地址,若否执行S303:为所述全局地址分配物理空间,并创建所述全局地址与物理地址之间的对应关系。具体的分配方式可参考前面的空间分配流程。若判断的结果为已经为所述全局地址分配了物理地址,则执行S304。
S304,计算单元221将所述待写入数据写入所述物理地址指示的物理空间中。该步骤可参考图9中S106的描述,这里不再赘述。
另外,与图9描述的过程类似,在本实施例中同样可以采用多副本机制或EC校验机制来保证数据可靠性,该部分也可以参考图9的描述。当数据初次写入存储节点集群时,通常都存储在内存池的DRAM层。随着数据的访问频率逐渐变低,或者DRAM层的空间容量逐渐变小,存储节点集群会内部触发数据迁移,该过程对计算节点集群来说不感知的。数据迁移策略保存在管理节点中,由管理节点根据数据迁移策略控制数据迁移操作。数据迁移策略包括但不限于:执行数据迁移操作的触发条件,例如周期性地执行,或者当一定条件满足时执行。这里的一定条件可以是数据的访问频率高于与低于设定阈值,也可以是数据所位于的存储器的可用容量高于或低于设定阈值等等。所谓“控制”是指:管理节点指示各个存储节点20监控其存储的数据的访问频率,以及指示存储节点20对其存储的数据在各个存储器之间进行迁移。
除了定期触发数据迁移操作之外,计算节点100在向存储节点发送写数据请求,可以在所述写数据请求中携带该数据的冷热信息(用于指示其访问频率),存储节点在执行所述写数据请求时,一种执行方式是:首先将该数据写入DRAM,然后立即根据所述冷热信息执行数据迁移操作,以将该数据从DRAM迁入与其冷热信息匹配的存储器。或者,存储节点也可以根据该数据的元数据结构或者逻辑地址等获得该数据的冷热信息,再根据所述冷热信息执行数据迁移操作。另一种执行方式是,存储节点直接根据所述冷热信息确定与其冷热信息匹配的存储器,通过IO控制器直接将所述数据写入所述 存储器中。
另外,计算节点100也可以在写数据请求中指定该数据的常驻策略,常驻策略是指某一种类型的数据需要长期存储在某种类型的存储器中。对于这类数据,一旦存储在指定的存储器中了,无论其访问频率是升高还是降低,都不会对其执行数据迁移操作。
以位于DRAM层的目标数据为例,假设所述目标数据位于存储节点20a,存储节点20a将定期统计所述目标数据的访问频率,当所述访问频率低于DRAM层的访问阈值时,将所述目标数据迁移至SCM层或其他层。一种可选方案是,内存池的每层存储器具有一个访问阈值区间,当数据的访问频率高于该区间的最高值或数据的访问频率低于该区间的最低值时,则意味着该数据需要被迁移至与其访问频率匹配的层。另一种可选方案是,不设置每层存储器的访问阈值区间,仅将所述访问频率与设定的访问阈值进行比较,当低于所述访问阈值时则意味需要将性能更低的层。继续以所述目标数据为例,如果所述目标数据当前的访问频率跌落至硬盘层的访问频率区间内,则首先判断存储节点20a本地的硬盘是否具有空闲空间,若有,则将其迁移至存储节点20a本地的硬盘;否则将所述目标数据发送给其他存储节点,例如存储节点20b,并指示存储节点20b将所述目标数据写入其硬盘中。迁移前后,所述目标数据的全局地址不会发生变化,因为上层应用是不感知的,发生变化的是所述目标数据的物理地址。迁移完成后,各个存储节点20均在自己的索引表中更新所述目标数据的全局地址与物理地址的对应关系。
除了按照数据的访问频率(又称为冷热程度)实现数据在各层之间的迁移之外,另一种迁移策略为根据各层的可用容量来实现数据迁移。我们知道越高层级的存储器的性能越好,成本也越高,其存储空间比低层的存储器更为珍贵。例如,当DRAM层的可用容量低于设定的容量阈值时,则DRAM层需要将其存储的一部分数据迁移至SCM层或其他层,以便腾挪出更多的空间容纳新写入的数据。至于选择哪部分数据往低层级存储器迁移,可以参考现有的内存淘汰算法,这里不再展开描述。同理,SCM层或者其他层也具有自己的容量阈值,当该层的可用容量低于所述容量阈值,则将存储的一部分数据迁移至其他层。
前面提到,内存池以页为粒度对外提供存储空间。因此,在统计数据的访问频率时也可以以页为单位来统计,那么相应的,也是以页为单位实现各层级之间的数据迁移。然而,在产品实践中,应用往往需要在页之上再分配更小粒度的对象,比如一个数据项(item)。如果页的大小是4KB,那么一个数据项的大小是1KB、2KB或3KB(只要比页的尺寸更小即可)这样,以页粒度进行访问频率统计必然存在较大的不准确性,可能一个页中某些数据项是经常访问的,但是此页中其它的数据项却是几乎没有怎么访问过,如果按照页粒度的页面统计,将会使得此页会常驻到DRAM或者SCM介质上,导致大量空间的浪费。因此本实施例还提供了数据项粒度的访问频率统计,以数据项为粒度进行数据迁移,然后聚合冷热页面,这样可以实现更为高效的换入和换出性能。
接下来介绍执行读数据请求方法的过程。请参考图11,图11是本实施例提供的执行读数据请求方法的流程示意图。如图11所示,该方法包括如下步骤。
S201,计算节点100发送给存储节点一个读数据请求,所述读数据请求中携带待读取数据的逻辑地址,存储节点的IO控制器22接收所述读数据请求。在LUN语义的应用场景中,所述逻辑地址包括LUN ID、LBA和length,在内存语义的应用场景中,所述逻辑地址包括虚拟空间ID、虚拟空间的起始地址和length。存储节点的通信单元220接 收所述读数据请求后,将所述写数据请求存储在DRAM 222中。
S202,计算单元221从DRAM 222中获取所述读数据请求,以所述逻辑地址为输入,按照一定算法输出key,所述key可以唯一定位到一个分区ID。
S203,计算单元221在索引表中查询所述分区ID对应的全局地址。
S204,计算单元221在索引表中查询所述全局地址对应的物理地址。
S205,计算单元221从所述物理地址指示的物理空间中读取所述待读取数据,由通信单元220将所述待读取数据返回给计算节点100。所述物理地址指示了所述物理空间位于的存储节点,所述存储节点内的存储器,以及所述存储器内的偏移量,所以计算单元221可以直接根据该地址进行读取。如果所述物理地址所指示的物理空间位于其他存储节点上,则将所述读数据请求发送给其他存储节点,并指示该节点从所述物理地址指示的物理空间中读取所述数据。
如果采用多副本机制存储数据,那么存储节点可以按照上面的流程读取任意一个数据副本,将该数据副本发送给计算节点100。如果采用EC校验机制,那么存储节点需要按照上面的流程读取每个数据分片和校验分片,合并获得所述待读取数据,并对所述待读取数据进行验证,验证无误后再返回给计算节点100。可以理解的是,与图11所示的读数据方法是与图9所示的写数据方法对应的,所以该方法中的读数据请求携带的是待读取数据的逻辑地址,本实施例还提供了另一种读数据方法,该方法与图10所示的写数据方法对应,在该方法中,所述读数据请求携带的是待读取数据的全局地址,计算单元221可以直接根据所述全局地址,查询到物理地址,从而获取所述待读取数据。
此外,图1至图8所示的内存池还支持数据预取机制。本领域技术人员可以理解,从性能较高的存储器读取数据的速度高于从性能较低的存储器读取数据的速度。因此,如果待读取数据在性能较高的存储器中命中,那么就不必再从性能较低的存储器中读取了,因此读取数据的效率较高。为了提高缓存的数据命中率,通常的做法是从性能较低的存储器中预先读取一段数据,写入性能较高的存储器中。那么,当计算节点100发送读数据请求要求读取这段数据时,由于该数据已经被提前读取到性能较高的存储器中了,因此IO控制器可以直接从所述性能较高的存储器中读取该数据。对于一段逻辑地址连续的数据来说,它们被一起读取的可能性较大。因此,在实践中通常根据逻辑地址来预取数据。数据预取的方式包括同步预取和异步预取。同步预取是指,当执行读数据请求而待读取数据在高层级的存储器中未命中时,根据本次待读取数据的逻辑地址,将与所述待读取数据的逻辑地址连续的数据从低层级的存储器中读取出来,写入高层级的存储器中。异步预取是指,当执行读数据请求而待读取数据在高层级的存储器中被命中时,根据本次待读取数据的逻辑地址,将与所述待读取数据的逻辑地址连续的数据从从低层级的存储器中读取出来,写入高层级的存储器中。
结合图11,所述执行读数据请求的方法中还可以包含:
S206,计算单元221将与所述待读取数据的逻辑地址连续的其他数据迁移到高层级的存储器中。在S205中,计算单元221从物理地址指示的物理空间中读取了所述待读取数据,所述待读取数据可能存储在高层级的存储器(例如DRAM)中,也可能存储在低层级的存储器(例如SCM)中。如果所述待读取数据存储在DRAM中,那么计算单元221在DRAM中命中。如果所述待读取数据存储在SCM中,那么计算单元221在DRAM中未命中。无论是哪种情况,计算单元221都可以将与所述待读取数据的逻辑地址连续 的其他数据预取到DRAM中。
具体的,计算单元221首先获取与所述待读取数据的逻辑地址连续的逻辑地址。为了描述方便,姑且将所述待读取数据的逻辑地址称为逻辑地址1,与逻辑地址1连续的逻辑地址称为逻辑地址2。计算单元221以逻辑地址2为输入,按照一定算法输出key,所述key可以唯一定位到一个分区ID。然后,计算单元221在索引表中查询所述分区ID对应的全局地址,以及所述全局地址对应的物理地址。最后,计算单元221再从所述物理地址指示的物理空间中读取所述其他数据。所述其他数据可能位于计算单元221本地的存储节点中,也可能位于其他存储节点中。如果所述物理地址所指示的物理空间位于其他存储节点上,则该节点从所述物理地址指示的物理空间中读取所述数据。
同理,如果计算节点100发送给存储节点20的读数据携带的是全局地址,那么预取数据时则根据所述全局地址,将与所述全局地址连续的地址中存储的数据提前读取至高层级的存储器中。
图12是本实施例提供的管理节点的一种结构示意图,管理节点包括了处理器401和存储器402。所述存储器402中存储有程序403。处理器401、存储器402和接口404之间通过***总线405连接并完成相互间的通信。处理器401是单核或多核中央处理单元,或者为特定集成电路,或者为被配置成实施本发明实施例的一个或多个集成电路。存储器402可以为随机存取存储器(Random Access Memory,RAM),也可以为非易失性存储器(non-volatile memory),例如至少一个硬盘存储器。存储器402用于存储计算机执行指令。具体的,计算机执行指令中可以包括程序403。当管理节点运行时,处理器401运行所述程序403以执行下述方法。
例如,创建内存池以提供存储数据的服务,所述内存池包括所述第一存储器和所述至少两个第二存储器;所述控制模块用于控制所述数据从所述第一存储器迁移至所述第二存储器,或者从所述第二存储器迁移至所述第一存储器。
可选的,所述方法还包括获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量。由此,管理节点在创建内存池时具体用于根据所述状态信息创建所述内存池。
可选的,所述管理节点在控制所述数据从第一存储器迁移至所述第二存储器时具体包括:指示所述第一存储节点获取所述数据的访问频率;以及指示所述第一存储节点在所述访问频率低于设定的频率阈值时将所述数据迁移至所述第二存储器。
图13是本实施例提供的管理节点的另一种结构示意图,所述管理节点包括创建模块501和控制模块502。其中,创建模块501用于创建内存池以提供存储数据的服务,所述内存池包括所述第一存储器和所述至少两个第二存储器。控制模块502用于控制所述数据从所述第一存储器迁移至所述第二存储器,或者从所述第二存储器迁移至所述第一存储器。
可选的,创建模块501还用于获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量。创建模块501在创建内存池时具体用于根据所述状态信息创建所述内存池。
可选的,控制模块502在控制所述数据从第一存储器迁移至所述第二存储器时具体用于:指示所述第一存储节点获取所述数据的访问频率;以及指示所述第一存储节点在所述访问频率低于设定的频率阈值时将所述数据迁移至所述第二存储器。
在实践中,创建模块501和控制模块502的功能均可由图12所示的处理器401执行程序403来实现,也可由处理器401独立实现。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))、或者半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。在本申请的文字描述中,字符“/”,一般表示前后关联对象是一种“或”的关系;在本申请的公式中,字符“/”,表示前后关联对象是一种“相除”的关系。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (34)

  1. 一种存储***,其特征在于,包括管理节点、一个或多个第一存储器和一个或多个第二存储器;
    所述管理节点用于:
    创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;
    控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
  2. 根据权1所述的存储***,其特征在于,所述管理节点还用于获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量;
    所述管理节点具体用于基于所述状态信息创建所述内存池。
  3. 根据权1所述的存储***,其特征在于,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述全局地址分配的物理空间在存储节点内的存储器内的位置,所述页的全局地址与所述页的物理地址的映射关系记录在索引表中。
  4. 根据权3所述的存储***,其特征在于,所述第一存储节点中保存有所述索引表,所述多个第二存储器中有一个第二存储器位于所述第一存储节点,所述第一存储节点用于:
    接收内存分配请求,所述内存分配请求用于请求为所述若干个页中的第一页的第一全局地址从类型为第二存储器的存储器中分配物理空间;
    从所述第一存储节点的所述第二存储器中分配物理空间,或者向所述第二存储节点申请分配所述第二存储节点中的所述第二存储器的物理空间,所述第一存储节点或所述第二存储节点中的所述第二存储器的分配的所述物理空间的位置为所述第一页的第一物理地址;
    将所述第一页的第一全局地址与所述第一页的第一物理地址的映射关系写入所述索引表。
  5. 根据权4所述的存储***,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器存储有所述索引表,所述IO控制器与计算节点通信;
    所述IO控制器用于:
    接收所述计算节点发送的第一数据和所述第一数据的第一逻辑地址;
    根据所述第一逻辑地址确定所述第一数据在所述内存池中的所述第一全局地址;
    根据所述索引表确定是否已经为所述第一全局地址分配有物理空间;
    当确定已经为所述第一全局地址分配有物理空间时,将所述第一数据写入所述第一 物理地址指示的物理空间中。
  6. 根据权3所述的存储***,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器与计算节点通信,所述IO控制器中存储有所述索引表;
    所述IO控制器用于:
    接收所述计算节点发送的第二数据和所述第二数据的第二逻辑地址;
    根据所述第二逻辑地址确定所述第二数据在所述内存池中的第二全局地址;
    根据所述索引表确定是否已经为所述第二全局地址分配有物理空间;
    当确定尚未为所述第二全局地址分配物理空间时,从所述第一存储器中分配物理空间,将所述第二全局地址与第二物理地址的映射关系写入所述索引表,以及将所述第二数据写入所述第二物理地址指示的物理空间中,所述第二物理地址用于指示从所述第一存储器中为所述第二全局地址分配的物理空间的位置。
  7. 根据权3所述的存储***,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器与计算节点通信,所述IO控制器中存储有所述索引表;
    所述IO控制器用于:
    接收所述计算节点发送的第三数据和所述第三数据在所述内存池中的第三全局地址;
    根据所述索引表确定所述第三全局地址对应的第三物理地址;
    将所述第三数据写入所述第三物理地址中,所述第三物理地址用于指示为所述第三全局地址分配的物理空间的位置。
  8. 根据权1所述的存储***,其特征在于,所述数据存储在所述第一存储器,所述管理节点具体用于:
    指示所述第一存储节点在所述数据的访问频率低于设定的频率阈值时将所述数据迁移至所述内存池中的第二存储器。
  9. 根据权5所述的存储***,其特征在于,
    所述IO控制器还用于:
    接收所述计算节点发送的读数据请求,所述读数据请求用于读取所述第一数据,所述读数据请求包括所述第一逻辑地址;
    根据所述第一逻辑地址确定所述第一数据在所述内存池中的第一全局地址;
    根据所述索引表确定所述第一全局地址对应的所述第一物理地址;
    从所述第一物理地址指示的物理空间的位置获取所述第一数据。
  10. 根据权9所述的存储***,其特征在于,所述IO控制器还用于:
    根据所述第一逻辑地址确定与所述第一数据相关联的其他数据,所述其他数据的逻辑地址与所述第一逻辑地址连续,所述其他数据位于所述内存池中的所述第二存储器中;
    获取所述其他数据,将所述其他数据迁移至所述内存池中的第一存储器中。
  11. 根据权1所述的存储***,其特征在于,所述存储***还包括计算节点,所述计算节点用于访问所述内存池中的所述数据,所述计算节点在物理上独立于所述第一存储节点和所述第二存储节点。
  12. 根据权1所述的存储***,其特征在于,所述存储***还包括计算节点,所述计算节点用于访问所述内存池中的所述数据,所述第一存储节点和所述第二存储节点中的一个存储节点和所述计算节点位于同一个物理设备中。
  13. 一种内存管理方法,其特征在于,所述方法应用于存储***中,所述存储***包括管理节点、一个或多个第一存储器和一个或多个第二存储器;
    所述管理节点创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;以及
    所述管理节点控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
  14. 根据权13所述的方法,其特征在于,所述方法还包括:
    所述管理节点获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量;
    所述创建内存池具体包括:基于所述状态信息创建所述内存池。
  15. 根据权13所述的方法,其特征在于,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置,所述页的全局地址与所述页的物理地址的映射关系记录在索引表中。
  16. 根据权15所述的方法,其特征在于,所述第一存储节点中保存有所述索引表,所述多个第二存储器中有一个第二存储器位于所述第一存储节点,所述方法还包括:
    所述第一存储节点接收内存分配请求,所述内存分配请求用于请求为所述若干个页中的第一页的第一全局地址从类型为第二存储器的存储器中分配物理空间;
    所述第一存储节点从所述第一存储节点的所述第二存储器中分配物理空间,或者向所述第二存储节点申请分配所述第二存储节点中的所述第二存储器的物理空间,所述第一存储节点或所述第二存储节点中的所述第二存储器的分配的所述物理空间的地址为所述第一页的第一物理地址,将所述第一页的全局地址与所述第一页的第一物理地址的映射关系写入所述索引表。
  17. 根据权16所述的方法,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器存储有所述索引表,所述IO控制器与计算节点通信;所述方法还包括:
    所述IO控制器接收所述计算节点发送的第一数据和所述第一数据的第一逻辑地址;
    所述IO控制器根据所述第一逻辑地址确定所述第一数据在所述内存池中的第一全局地址;
    所述IO控制器根据所述索引表确定是否已经为所述第一全局地址分配有物理空间;
    当确定已经为所述第一全局地址分配有物理空间时,所述计算单元将所述第一数据写入所述第一物理地址指示的物理空间中。
  18. 根据权15所述的方法,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器与计算节点通信,,所述IO控制器中存储有所述索引表;所述方法还包括:
    所述IO控制器接收所述计算节点发送的第二数据和所述第二数据的第二逻辑地址;
    所述IO控制器根据所述第二逻辑地址确定所述第二数据的第二全局地址;
    所述IO控制器根据所述索引表确定是否已经为所述第二全局地址分配有物理空间;
    当确定尚未为所述第二全局地址分配物理空间时,所述IO控制器从所述第一存储器中分配物理空间,将所述第二全局地址与第二物理地址的映射关系写入所述索引表,以及将所述第二数据写入所述第二物理地址指示的物理空间中,所述第二物理地址用于指示从所述第一存储器中分配的物理空间。
  19. 根据权15所述的方法,其特征在于,所述第一存储节点包括IO控制器,所述IO控制器与计算节点通信,所述IO控制器中存储有所述索引表;所述方法还包括:
    所述IO控制器接收所述计算节点发送的第三数据和所述第三数据的第三全局地址;
    所述IO控制器根据所述索引表确定所述第三全局地址对应的第三物理地址;
    所述IO控制器将所述第三数据写入所述第三物理地址中,所述第二物理地址用于指示为所述第三全局地址分配的物理空间的位置。
  20. 根据权13所述的方法,其特征在于,所述数据存储在所述第一存储器,
    所述控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移具体包括:
    所述管理节点指示所述第一存储节点在所述数据的访问频率低于设定的频率阈值时将所述数据迁移至所述第二存储器。
  21. 根据权17所述的方法,其特征在于,所述方法还包括:
    所述IO控制器接收所述计算节点发送的读数据请求,所述读数据请求用于读取所述第一数据,所述读数据请求包括所述第一逻辑地址;
    所述IO控制器根据所述第一逻辑地址确定所述第一数据的第一全局地址;
    所述IO控制器根据所述索引表确定所述第一全局地址对应的所述第一物理地址;
    所述IO控制器从所述第一全局地址对应的所述第一物理地址指示的物理空间中获取所述第一数据。
  22. 根据权21所述的方法,其特征在于,所述方法还包括:
    所述IO控制器根据所述第一逻辑地址确定与所述第一数据相关联的其他数据,所述其他数据的逻辑地址与所述第一逻辑地址连续,所述其他数据位于所述内存池中的所述第二存储器中;
    所述计算单元获取所述其他数据,将所述其他数据迁移至所述内存池中的第一存储器中。
  23. 一种管理节点,其特征在于,所述管理节点位于存储***中,所述存储***包括一个或多个第一存储器和一个或多个第二存储器;
    所述管理节点包括创建模块和控制模块;
    所述创建模块用于创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;
    所述控制模块用于控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
  24. 根据权23所述的管理节点,其特征在于,所述创建模块还用于获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量;
    所述创建模块具体用于根据所述状态信息创建所述内存池。
  25. 根据权23所述的管理节点,其特征在于,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。
  26. 根据权23所述的管理节点,其特征在于,所述数据存储在所述第一存储器,所述控制模块具体用于:
    指示所述第一存储节点在所述数据的访问频率低于设定的频率阈值时将所述数据迁移至所述内存池中的第二存储器。
  27. 一种管理节点,其特征在于,所述管理节点位于存储***中,所述存储***包括一个或多个第一存储器和一个或多个第二存储器;
    所述管理节点包括接口和处理器;
    其中,
    所述处理器用于:
    创建内存池以提供存储数据的服务,所述内存池包括所述一个或多个第一存储器以及所述一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;
    控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移;
    所述接口用于与所述第一存储节点以及所述第二存储节点通信。
  28. 根据权27所述的管理节点,其特征在于,所述处理器还用于通过所述接口获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量;
    所述处理器具体用于根据所述状态信息创建所述内存池。
  29. 根据权27所述的管理节点,其特征在于,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页分配的物理空间在存储节点内的存储器内的位置。
  30. 根据权27所述的管理节点,其特征在于,所述数据存储在所述第一存储器,所述处理器具体用于:
    指示所述第一存储节点在所述数据的访问频率低于设定的频率阈值时将所述数据迁移至所述内存池中的第二存储器。
  31. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,所述程序指令用于执行以下方法:
    创建内存池以提供存储数据的服务,所述内存池包括一个或多个第一存储器以及一个或多个第二存储器,所述第一存储器的性能高于所述第二存储器,其中,所述一个或多个第一存储器中至少有一个第一存储器位于第一存储节点,所述一个或多个第二存储器中至少有一个第二存储器位于第二存储节点;以及
    控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移。
  32. 根据权31所述的计算机可读存储介质,其特征在于,还包括:
    获取存储器的状态信息,所述状态信息包括所述第一存储器的类型和容量,以及所述第二存储器的类型和容量;
    所述创建内存池具体包括:基于所述状态信息创建所述内存池。
  33. 根据权31所述的计算机可读存储介质,其特征在于,所述内存池的存储空间包括若干个页,一个页在内存池中的全局地址映射到所述页的物理地址;其中,所述页的全局地址用于指示所述页在所述内存池中的位置,所述页的物理地址用于指示为所述页 分配的物理空间在存储节点内的存储器内的位置。
  34. 根据权31所述的计算机可读存储介质,其特征在于,
    所述控制所述数据在所述内存池中的所述第一存储器与所述第二存储器之间迁移具体包括:
    指示所述第一存储节点在所述数据的访问频率低于设定的频率阈值时将所述数据迁移至所述第二存储器。
PCT/CN2020/119857 2020-04-28 2020-10-07 一种存储***、内存管理方法和管理节点 WO2021218038A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2021569533A JP7482905B2 (ja) 2020-04-28 2020-10-07 ストレージシステム、メモリ管理方法、および管理ノード
EP20933906.8A EP3958107A4 (en) 2020-04-28 2020-10-07 STORAGE SYSTEM, MEMORY MANAGEMENT METHOD AND MANAGEMENT NODE
US17/510,388 US11861204B2 (en) 2020-04-28 2021-10-26 Storage system, memory management method, and management node
US18/527,353 US20240094936A1 (en) 2020-04-28 2023-12-03 Storage system, memory management method, and management node

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010348770.0 2020-04-28
CN202010348770 2020-04-28
CN202010625111.7A CN113568562A (zh) 2020-04-28 2020-07-01 一种存储***、内存管理方法和管理节点
CN202010625111.7 2020-07-01

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/510,388 Continuation US11861204B2 (en) 2020-04-28 2021-10-26 Storage system, memory management method, and management node

Publications (1)

Publication Number Publication Date
WO2021218038A1 true WO2021218038A1 (zh) 2021-11-04

Family

ID=78158704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119857 WO2021218038A1 (zh) 2020-04-28 2020-10-07 一种存储***、内存管理方法和管理节点

Country Status (5)

Country Link
US (2) US11861204B2 (zh)
EP (1) EP3958107A4 (zh)
JP (1) JP7482905B2 (zh)
CN (3) CN114610232A (zh)
WO (1) WO2021218038A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11687443B2 (en) * 2013-05-15 2023-06-27 EMC IP Holding Company LLC Tiered persistent memory allocation
CN114489475A (zh) * 2021-12-01 2022-05-13 阿里巴巴(中国)有限公司 分布式存储***及其数据存储方法
CN114153754B (zh) * 2022-02-08 2022-04-29 维塔科技(北京)有限公司 用于计算集群的数据传输方法、装置及存储介质
WO2023193814A1 (zh) * 2022-04-08 2023-10-12 华为技术有限公司 融合***的数据处理方法、装置、设备和***
CN117311593A (zh) * 2022-06-25 2023-12-29 华为技术有限公司 数据处理方法、装置及***
CN117312224A (zh) * 2022-06-27 2023-12-29 华为技术有限公司 数据处理***、方法、装置和控制器
CN116204137B (zh) * 2023-05-04 2023-08-04 苏州浪潮智能科技有限公司 基于dpu的分布式存储***、控制方法、装置及设备
CN117032040B (zh) * 2023-08-31 2024-06-07 中科驭数(北京)科技有限公司 传感器数据获取方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085626A1 (en) * 2004-10-20 2006-04-20 Seagate Technology Llc Updating system configuration information
CN104102460A (zh) * 2014-07-23 2014-10-15 浪潮(北京)电子信息产业有限公司 一种基于云计算的内存管理方法及装置
CN107229573A (zh) * 2017-05-22 2017-10-03 上海天玑数据技术有限公司 一种基于固态硬盘的弹性高可用缓存方法
US20170329541A1 (en) * 2016-05-11 2017-11-16 Hitachi, Ltd. Data storage system, process and computer program for such data storage system for reducing read and write amplifications
CN110134514A (zh) * 2019-04-18 2019-08-16 华中科技大学 基于异构内存的可扩展内存对象存储***

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5884313A (en) * 1997-06-30 1999-03-16 Sun Microsystems, Inc. System and method for efficient remote disk I/O
US6370585B1 (en) * 1997-09-05 2002-04-09 Sun Microsystems, Inc. Multiprocessing computer system employing a cluster communication launching and addressing mechanism
US7975109B2 (en) 2007-05-30 2011-07-05 Schooner Information Technology, Inc. System including a fine-grained memory and a less-fine-grained memory
CN102340530B (zh) * 2010-07-26 2015-10-14 杭州信核数据科技有限公司 一种存储空间接管和数据迁移的方法和***
US9811288B1 (en) * 2011-12-30 2017-11-07 EMC IP Holding Company LLC Managing data placement based on flash drive wear level
CN102622304A (zh) * 2012-02-07 2012-08-01 中山爱科数字科技股份有限公司 一种双层地址空间映射的内存优化方法
JP5981563B2 (ja) * 2012-04-26 2016-08-31 株式会社日立製作所 情報記憶システム及び情報記憶システムの制御方法
KR101665611B1 (ko) * 2012-05-08 2016-10-12 마벨 월드 트레이드 리미티드 컴퓨터 시스템 및 메모리 관리의 방법
US9430400B2 (en) * 2013-03-14 2016-08-30 Nvidia Corporation Migration directives in a unified virtual memory system architecture
KR20150132099A (ko) * 2013-03-20 2015-11-25 휴렛-팩커드 디벨롭먼트 컴퍼니, 엘.피. 서로 다른 계층 레벨의 메모리 노드를 가진 메모리 시스템에서의 데이터 캐싱
CN116301649A (zh) 2013-04-26 2023-06-23 株式会社日立制作所 存储***
CN103440208B (zh) * 2013-08-12 2016-02-03 华为技术有限公司 一种数据存储的方法及装置
CN105940386B (zh) * 2014-01-30 2019-12-17 慧与发展有限责任合伙企业 用于在存储器之间移动数据的方法、***和介质
KR102127116B1 (ko) * 2014-03-12 2020-06-26 삼성전자 주식회사 분산 데이터 저장 장치 및 분산 데이터 저장 방법
CN105095094B (zh) 2014-05-06 2018-11-30 华为技术有限公司 内存管理方法和设备
US20160011816A1 (en) * 2014-07-09 2016-01-14 Nexenta Systems, Inc. Method to optimize inline i/o processing in tiered distributed storage systems
JP6443170B2 (ja) * 2015-03-26 2018-12-26 富士通株式会社 階層ストレージ装置,階層ストレージ制御装置,階層ストレージ制御プログラム及び階層ストレージ制御方法
JP2016192170A (ja) * 2015-03-31 2016-11-10 富士通株式会社 ストレージ制御装置、ストレージシステムおよびストレージ制御プログラム
US10129361B2 (en) * 2015-07-01 2018-11-13 Oracle International Corporation System and method for multi-version remote function execution control in a distributed computing environment
WO2017028309A1 (zh) * 2015-08-20 2017-02-23 华为技术有限公司 文件数据访问方法和计算机***
US10042751B1 (en) * 2015-09-30 2018-08-07 EMC IP Holding Company LLC Method and system for multi-tier all-flash array
US11240334B2 (en) 2015-10-01 2022-02-01 TidalScale, Inc. Network attached memory using selective resource migration
US10055139B1 (en) * 2016-03-31 2018-08-21 EMC IP Holding Company LLC Optimized layout in a two tier storage
US10740016B2 (en) * 2016-11-11 2020-08-11 Scale Computing, Inc. Management of block storage devices based on access frequency wherein migration of block is based on maximum and minimum heat values of data structure that maps heat values to block identifiers, said block identifiers are also mapped to said heat values in first data structure
CN108804350B (zh) * 2017-04-27 2020-02-21 华为技术有限公司 一种内存访问方法及计算机***
US10534719B2 (en) 2017-07-14 2020-01-14 Arm Limited Memory system for a data processing network
CN110199270B (zh) * 2017-12-26 2022-09-02 华为技术有限公司 存储***中存储设备的管理方法及装置
US10534559B2 (en) * 2018-02-14 2020-01-14 International Business Machines Corporation Heat-tiered storage system having host awareness
CN110858124B (zh) * 2018-08-24 2021-06-01 华为技术有限公司 数据迁移方法及装置
JP6853227B2 (ja) * 2018-10-10 2021-03-31 株式会社日立製作所 ストレージシステム及びストレージ制御方法
US10795576B2 (en) * 2018-11-01 2020-10-06 Micron Technology, Inc. Data relocation in memory
US11016908B2 (en) * 2018-12-11 2021-05-25 International Business Machines Corporation Distributed directory of named data elements in coordination namespace
US10853252B2 (en) * 2019-01-30 2020-12-01 EMC IP Holding Company LLC Performance of read operations by coordinating read cache management and auto-tiering
US20230118994A1 (en) * 2022-11-01 2023-04-20 Intel Corporation Serverless function instance placement among storage tiers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085626A1 (en) * 2004-10-20 2006-04-20 Seagate Technology Llc Updating system configuration information
CN104102460A (zh) * 2014-07-23 2014-10-15 浪潮(北京)电子信息产业有限公司 一种基于云计算的内存管理方法及装置
US20170329541A1 (en) * 2016-05-11 2017-11-16 Hitachi, Ltd. Data storage system, process and computer program for such data storage system for reducing read and write amplifications
CN107229573A (zh) * 2017-05-22 2017-10-03 上海天玑数据技术有限公司 一种基于固态硬盘的弹性高可用缓存方法
CN110134514A (zh) * 2019-04-18 2019-08-16 华中科技大学 基于异构内存的可扩展内存对象存储***

Also Published As

Publication number Publication date
US20240094936A1 (en) 2024-03-21
US20220057954A1 (en) 2022-02-24
CN114860163A (zh) 2022-08-05
CN113568562A (zh) 2021-10-29
CN114860163B (zh) 2023-08-22
JP7482905B2 (ja) 2024-05-14
EP3958107A1 (en) 2022-02-23
EP3958107A4 (en) 2022-08-17
US11861204B2 (en) 2024-01-02
JP2022539950A (ja) 2022-09-14
CN114610232A (zh) 2022-06-10

Similar Documents

Publication Publication Date Title
WO2021218038A1 (zh) 一种存储***、内存管理方法和管理节点
US10891055B2 (en) Methods, systems and devices relating to data storage interfaces for managing data address spaces in data storage devices
US10657101B2 (en) Techniques for implementing hybrid flash/HDD-based virtual disk files
KR101726824B1 (ko) 캐시 아키텍처에서 하이브리드 미디어의 효율적인 사용
US9141529B2 (en) Methods and apparatus for providing acceleration of virtual machines in virtual environments
US9984004B1 (en) Dynamic cache balancing
WO2020037986A1 (zh) 数据迁移方法及装置
US8694563B1 (en) Space recovery for thin-provisioned storage volumes
EP4105770A1 (en) B+ tree access method and apparatus, and computer-readable storage medium
CN115794669A (zh) 一种扩展内存的方法、装置及相关设备
WO2023045492A1 (zh) 一种数据预取方法、计算节点和存储***
US9189407B2 (en) Pre-fetching in a storage system
US8769196B1 (en) Configuring I/O cache
Guo et al. HP-mapper: A high performance storage driver for docker containers
US20240020014A1 (en) Method for Writing Data to Solid-State Drive
US10853257B1 (en) Zero detection within sub-track compression domains
CN116340203A (zh) 数据预读取方法、装置、处理器及预取器
CN111796757B (zh) 一种固态硬盘缓存区管理方法和装置
US11144445B1 (en) Use of compression domains that are more granular than storage allocation units
US12008241B2 (en) Techniques for collecting and utilizing activity metrics
KR102149468B1 (ko) 통합 캐시를 하나 또는 복수의 논리 유닛에 동적 할당하는 시스템 및 방법
US20240176741A1 (en) Caching techniques using a two-level read cache
US20240231628A9 (en) Techniques for determining and using temperature classifications with adjustable classification boundaries
US20240134531A1 (en) Techniques for determining and using temperature classifications with adjustable classification boundaries
US20240134712A1 (en) Techniques for efficient flushing and providing optimal resource utilization

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021569533

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020933906

Country of ref document: EP

Effective date: 20211118

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933906

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE