WO2017167106A1 - 存储*** - Google Patents

存储*** Download PDF

Info

Publication number
WO2017167106A1
WO2017167106A1 PCT/CN2017/077755 CN2017077755W WO2017167106A1 WO 2017167106 A1 WO2017167106 A1 WO 2017167106A1 CN 2017077755 W CN2017077755 W CN 2017077755W WO 2017167106 A1 WO2017167106 A1 WO 2017167106A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
node
disk
nodes
network
Prior art date
Application number
PCT/CN2017/077755
Other languages
English (en)
French (fr)
Inventor
王东临
金友兵
齐宇
Original Assignee
北京书生国际信息技术有限公司
书生云公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京书生国际信息技术有限公司, 书生云公司 filed Critical 北京书生国际信息技术有限公司
Publication of WO2017167106A1 publication Critical patent/WO2017167106A1/zh
Priority to US16/139,712 priority Critical patent/US10782898B2/en
Priority to US16/378,076 priority patent/US20190235777A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications

Definitions

  • the present invention relates to the technical field of data storage systems, and more particularly to a storage system.
  • a typical highly available distributed storage system connects the physical servers of multiple devices.
  • one storage server fails, its workload will be taken over by other storage servers.
  • the method of heartbeat is commonly used. The two servers are connected by heartbeat. If one server cannot receive the heartbeat signal of another server, Just judge that another server has failed. There is a certain problem with this method. When the server has no fault, only the heartbeat line fails, and a misjudgment will occur. It may even happen that both servers think that the other party has a fault and robbed each other of the workload of the other party.
  • the arbitration disk is used to solve this problem.
  • the quorum disk is the storage space shared by the master and slave servers. It is determined whether the corresponding server is faulty by writing a specific signal to the arbitrator disk. But in fact, this technology does not completely solve the problem. If only the channel leading to the arbitrator disk fails, but the server is still intact, the same problem still exists.
  • embodiments of the present invention provide a storage system that improves the availability of an arbitration disk and improves the reliability judgment of the storage system.
  • At least two storage nodes connected to the storage network
  • each storage device includes at least one storage medium, and all storage media included in the at least one storage device constitute a storage pool;
  • the storage network is configured such that each storage node can access all storage media without resorting to other storage nodes;
  • the storage pool is divided into at least two storage areas, and one storage area is selected from the at least two storage areas as a global arbitration disk.
  • the storage nodes virtual machines, containers, and the like
  • the storage nodes are also in the shared storage, they are in the same shared storage pool as the quorum disk.
  • the same storage channel is taken, so if a server cannot read and write the quorum disk, whether the server is faulty or the related storage channel fails, the computing node on the server will not work normally. It is especially accurate to determine if a server has failed.
  • FIG. 1 shows a block diagram of a memory system constructed in accordance with one embodiment of the present invention.
  • FIG. 2 shows a block diagram of a memory system in accordance with an embodiment of the present invention.
  • FIG. 1 shows a block diagram of a memory system in accordance with an embodiment of the present invention.
  • the storage system includes a storage network, and a storage node is connected to the storage network, wherein the storage node is a software module that provides a storage service, instead of hardware including a storage medium in a general sense.
  • a server and a storage device, also connected to the storage network.
  • Each storage device includes at least one storage medium.
  • the storage network is configured such that each storage node can access all storage media without resorting to other storage nodes.
  • each storage node can access all storage media without using other storage nodes, so that all storage media of the present invention are actually shared by all storage nodes, thereby realizing global storage. The effect of the pool.
  • the storage node is located on the side of the storage medium, or strictly speaking, the storage medium is a built-in disk of the physical machine where the storage node is located. In the embodiment of the present invention, the storage node is located.
  • the physical machine is independent of the storage device, and the storage device is more used as a channel for connecting the storage medium to the storage network.
  • the storage node side further includes a computing node, and the computing node and the storage node are disposed in a physical server, and the physical server is connected to the storage device through the storage network.
  • the aggregated storage system in which the computing node and the storage node are located in the same physical machine constructed by using the embodiment of the present invention can reduce the number of physical devices required, thereby reducing the cost.
  • the compute node can also access the storage resources it wishes to access locally.
  • the data exchange between the two can be as simple as shared memory, and the performance is particularly excellent.
  • the length of the I/O data path between the computing node and the storage medium includes: (1) the storage medium to the storage node; and (2) the storage node to the computing node aggregated in the same physical server. (CPU bus path).
  • the prior art shown in Figure 1 The storage system, the I/O data path length between the compute node and the storage medium includes: (1) storage medium to storage node; (2) storage node to storage network access network switch; (3) storage network connection Incoming network switch to core network switch; (4) core network switch to computing network access network switch; and (5) computing network access network switch to computing node.
  • the total data path of the storage system of the embodiment of the present invention is only close to item (1) of the conventional storage system. That is, the storage system provided by the embodiment of the present invention can greatly improve the I/O channel performance of the storage system by extremely compressing the I/O data path length, and the actual running effect is very close to the I/O of the local hard disk. O channel.
  • the storage node may be a virtual machine of a physical server, a container, or a module directly running on a physical operating system of the server, and the computing node may also be a virtual machine of the same physical machine server, A container or a module running directly on the physical operating system of the server.
  • each storage node may correspond to one or more compute nodes.
  • one physical server may be divided into multiple virtual machines, one of which is used as a storage node, and the other virtual machine is used as a computing node; or a module on the physical OS is used as a storage node, so as to implement Better performance.
  • the virtualization technology forming the virtual machine may be KVM or Zen or VMware or Hyper-V virtualization technology
  • the container technology forming the container may be Docker or Rocket or Odin or Chef or LXC or Vagrant. Or Ansible or Zone or Jail or Hyper-V container technology.
  • each storage node is only responsible for managing a fixed storage medium at the same time, and one storage medium is not simultaneously written by multiple storage nodes to avoid data conflict, thereby enabling each storage node to be able to implement each storage node.
  • the storage medium managed by it is accessed without resorting to other storage nodes, and the integrity of the data stored in the storage system can be guaranteed.
  • all the storage media in the system may be divided according to storage logic.
  • the storage pool of the entire system may be divided into a storage area, a storage group, and A logical storage hierarchy of memory blocks in which the storage blocks are the smallest unit of storage.
  • the storage pool may be divided into at least two storage areas.
  • each storage area may be divided into at least one storage group. In a preferred embodiment, each storage area is divided into at least two storage groups.
  • the storage area and the storage group can be merged such that one level can be omitted in the storage hierarchy.
  • each storage area may be composed of at least one storage block, wherein the storage block may be a complete storage medium or a part of a storage medium.
  • each storage area may be composed of at least two storage blocks, and when any one of the storage blocks fails, the complete storage block may be calculated from the remaining storage blocks in the group.
  • the data is stored.
  • the redundant storage mode can be multi-copy mode, independent redundant disk array (RAID) mode, and erasure code mode.
  • the redundant storage mode can be established by the ZFS file system.
  • the plurality of storage blocks included in each storage area (or storage group) are not located in the same storage medium, or even located in the same storage medium. In the storage device. In an embodiment of the invention, any two storage blocks included in each storage area (or storage group) are not located in the same storage medium/storage device. In another embodiment of the present invention, the number of storage blocks located in the same storage medium/storage device in the same storage area (or storage group) is preferably less than or equal to the redundancy of the redundant storage.
  • the redundant storage redundancy is 1, and the number of storage blocks in the same storage group of the same storage device is at most 1; for RAID 6, the redundant storage is With a redundancy of 2, the number of memory blocks in the same storage group on the same storage device is up to 2.
  • each storage node can only read and write its own managed storage area. Since the read operations of the same storage block by multiple storage nodes do not conflict with each other, and multiple storage nodes write one storage block at the same time, conflicts are easily generated. Therefore, in another embodiment, each storage node can only Write your own managed storage area, but you can read your own managed storage The storage area and other storage areas managed by the storage node, that is, the write operation is local, but the read operation can be global.
  • the storage system may further include a storage control node coupled to the storage network for determining a storage area managed by each storage node.
  • each storage node may include a storage allocation module for determining a storage area managed by the storage node, which may be handled by communication and coordination between respective storage allocation modules included in each storage node. The algorithm is implemented, which may for example be based on the principle of load balancing between the various storage nodes.
  • other or all of the storage nodes may be configured such that the storage nodes take over the storage area previously managed by the failed storage node.
  • one of the storage nodes may take over a storage area managed by the failed storage node, or may be taken over by at least two other storage nodes, wherein each storage node takes over a portion of the storage area managed by the failed storage node, For example, at least two other storage nodes respectively take over different storage groups in the storage area.
  • the storage medium may include, but is not limited to, a hard disk, a flash memory, an SRAM, a DRAM, an NVME, or the like.
  • the access interface of the storage medium may include, but is not limited to, a SAS interface, a SATA interface, a PCI/e interface, a DIMM interface, NVMe interface, SCSI interface, and AHCI interface.
  • the storage network may include at least one storage switching device, and the storage node accesses the storage medium through data exchange between the storage switching devices included therein. Specifically, the storage node and the storage medium are respectively connected to the storage switching device through the storage channel.
  • a storage system supporting multipoint control is provided in which a single storage space can be accessed through multiple channels, such as by a compute node.
  • the storage switching device may be a SAS switch or a PCI/e switch.
  • the storage channel may be a SAS (Serial Attached SCSI) channel or a PCI/e channel.
  • the solution has the advantages of high performance, large bandwidth, and a large number of disks in a single device.
  • HBA host adapter
  • SAS interface on a server board
  • the storage provided by the SAS system can be easily accessed by multiple servers connected simultaneously.
  • the SAS switch is connected to the storage device through a SAS line, and the storage device and the storage medium are also connected by a SAS interface.
  • the storage device internally connects the SAS channel to each storage medium (may be in the storage device) Internally set a SAS switch chip). Since the bandwidth of a SAS network can reach 24Gb or 48Gb, which is several times that of Gigabit Ethernet, and several times that of an expensive 10 Gigabit Ethernet; at the same time, the link layer SAS has an order of magnitude improvement over the IP network, in transmission. Layer, because the TCP protocol three-way handshake is closed four times, the overhead is very high, and the TCP delay acknowledgement mechanism and slow start sometimes cause a delay of 100 milliseconds.
  • SAS networks offer significant advantages in terms of bandwidth and latency over Ethernet-based TCP/IP. Those skilled in the art will appreciate that the performance of the PCI/e channel can also be adapted to the needs of the system.
  • the storage network may include at least two storage switching devices, each of which may be connected to any one of the storage devices through any one of the storage switching devices, thereby being connected to the storage medium.
  • the storage node reads and writes data on the storage device through other storage switching devices.
  • the storage devices in the storage system 30 are constructed as a plurality of JBODs 307-310, which are respectively connected to the two SAS switches 305 and 306 through SAS data lines, which constitute the switching core of the storage network included in the storage system.
  • the front end is at least two servers 301 and 302, each of which is connected to the two SAS switches 305 and 306 via an HBA device (not shown) or a SAS interface on the motherboard.
  • Each server has a storage node that manages some or all of the disks in all JBOD disks using information obtained from the SAS links.
  • the storage area, storage group, and storage block described above divide the JBOD disk into different storage groups.
  • Each storage node manages one or more sets of such storage groups.
  • redundant storage is used inside each storage group, redundantly stored metadata can exist on the disk, so that redundant storage can be directly recognized from the disk by other storage nodes.
  • the storage node can install a monitoring and management module that is responsible for monitoring the status of local storage and other servers.
  • a JBOD is abnormal overall or a disk on the JBOD is abnormal, data reliability is ensured by redundant storage.
  • the management module in the storage node on another pre-configured server will locally identify and take over the disk managed by the storage node of the failed server according to the data on the disk.
  • the storage node originally provided by the storage node of the faulty server will also be extended on the storage node on the new server. So far, a new highly available global storage pool structure has been implemented.
  • the exemplary storage system 30 is constructed to provide a multi-point, controllable, globally accessible storage pool.
  • the hardware uses multiple servers to provide external services, and uses JBOD to store disks.
  • Multiple JBODs are connected to two SAS switches, and the two switches are respectively connected to the server's HBA cards, thereby ensuring that all disks on the JBOD can be accessed by all servers.
  • the SAS redundant link also ensures high availability on the link.
  • each server uses redundant storage technology to select redundant disks from each JBOD to avoid redundant data loss.
  • the module that monitors the overall state will schedule another server to access the disks managed by the storage node of the failed server through the SAS channel, and quickly take over the disks that the other party is responsible for, achieving high-available global storage.
  • JBOD storage disk is illustrated in FIG. 2 as an example, it should be understood that the embodiment of the present invention as shown in FIG. 2 also supports a storage device other than JBOD.
  • the above is an example in which one storage medium (entire) is used as one storage block, and the same applies to a case where a part of one storage medium is used as one storage block.
  • each server may be monitored for failure by dividing the global storage pool into at least two storage areas, and selecting one of the at least two storage areas as a global arbitration disk.
  • Each storage node is capable of reading and writing the global quorum disk, but at the same time is only responsible for managing zero to more storage areas in the remaining storage area (except for the storage area where the global quorum disk is located).
  • the global quorum disk is used by the upper layer application of the server, that is, the storage node, that is, each storage node can directly read and write the global quorum disk. Due to the multi-point control of storage access, each storage node can simultaneously see the contents of other storage node updates.
  • the storage space of the global arbitration disk is divided into a plurality of fixed partitions, each of the plurality of fixed partitions being respectively assigned to each of the one or more storage nodes
  • the storage node can avoid concurrent read and write conflicts of multiple control nodes for the quorum disk.
  • the global arbitration disk may be configured such that each of the one or more storage nodes can only perform a write operation on the fixed partition allocated thereto when using the global arbitration disk. Reads are performed on fixed partitions assigned to other storage nodes. Enables storage nodes to update their own state while understanding the state changes of other storage nodes.
  • an election lock may be provided on the global arbitration disk.
  • the remaining storage nodes use the election lock mechanism to elect the takeover node.
  • the value of the election lock mechanism is even greater.
  • the global arbitration disk as a storage area may also have the characteristics of the storage area as discussed above.
  • the global arbitration disk includes one or more storage media or includes some or all of one or more storage media.
  • the storage medium included in the global arbitration disk may be located in the same or different storage devices.
  • the global arbitration disk may be composed of one complete storage medium, or may be composed of two complete storage media, or a part of two storage media, or may be a part of one storage medium and another one or several A complete storage medium.
  • the global arbitration disk may be configured by redundant storage of all or part of at least two storage media on at least two storage devices.
  • JBOD the storage medium
  • each storage node server can access all the storage resources on the JBOD, some storage space can be extracted from one or more disks of each JBOD, and the combination is used as a global arbitration. Use the disk.
  • the reliability of the arbitration disk can be easily improved. In the most severe cases, the quorum disk still works when only one JBOD in the system survives.
  • the storage nodes (virtual machines, containers, and the like) on each physical server are also stored in the global storage pool, specifically, they are located in the same sharing as the arbitration disk.
  • the normal read and write of the global storage pool between the compute node and the storage node is the same storage channel as the storage node reads and writes the quorum disk.
  • the compute nodes on the server are definitely not working properly, that is, they cannot access their normal storage resources. Therefore, it is very reliable to judge whether the corresponding computing node works effectively by such an arbitration disk structure.
  • each storage node continuously writes data to the quorum disk.
  • each storage node continuously monitors (by reading) whether other storage nodes periodically write data to the quorum disk. If the storage node does not write data to the quorum disk on time, it can be determined that the compute node corresponding to the storage node is not working properly.
  • the manner in which the storage node continuously writes heartbeat data to the arbitration disk is that the storage node periodically writes heartbeat data to the arbitration disk at a preset time interval of the system, for example, writing data into the arbitration disk every five seconds.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种存储***,提供了一种高可用的全局仲裁盘。该存储***包括:存储网络、至少两个存储节点和至少一个存储设备,每个存储设备包括至少一个存储介质,并且至少一个存储设备所包括的所有存储介质构成一个存储池。所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质;将存储池划分成至少两个存储区域,并且从至少两个存储区域中选取一个存储区域作为全局仲裁盘。

Description

存储*** 技术领域
本发明涉及数据存储***的技术领域,更具体地,涉及一种存储***。
背景技术
随着计算机应用规模越来越大,对存储空间的需求也与日俱增。一个典型的高可用的分布式存储***将复数设备的物理服务器连接起来,其中一台存储服务器出现故障时其工作负载将由其它存储服务器接管。在判断一台服务器是否出现故障时,在判断一台服务器是否出现故障时,常用心跳线的方法,两台服务器之间用心跳线连接起来,如果一台服务器无法接收另一服务器的心跳信号,就判断另一服务器出现故障。该方法存在一定的问题,当服务器没有故障只是心跳线出现故障时,就会出现误判,甚至可能出现两台服务器都以为对方出现故障,互相抢夺接管对方工作负载的情形。
仲裁盘是用于解决这个问题的。仲裁盘是主从服务器共享的存储空间,通过能否往仲裁盘里写特定信号来判断相应服务器是否出现故障。但实际上,这种技术并不能完全解决问题,如果只是通向仲裁盘的通道出现了故障,但服务器依然完好,则同样的问题依然存在。
发明内容
有鉴于此,本发明实施方式提供一种存储***,提升仲裁盘的可用性,并且改善存储***的可靠性判断。
本发明实施例提供的存储***包括:
存储网络;
至少两个存储节点,连接至所述存储网络;以及
至少一个存储设备,连接至所述存储网络,每个存储设备包括至少一个存储介质,并且至少一个存储设备所包括的所有存储介质构成一个存储池;
所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质;
将所述存储池划分成至少两个存储区域,并且从所述至少两个存储区域中选取一个存储区域作为全局仲裁盘。
利用本发明实施例提供的存储***,由于每台物理服务器上的计算节点(虚拟机、容器等)的自身存储也都是在共享存储中,和仲裁盘一样都在同样的共享存储池中,走的是同样的存储通道,这样一台服务器如果无法读写仲裁盘,不管是服务器出现了故障还是相关存储通道出现了故障,该服务器上的计算节点也肯定无法正常工作,此时用仲裁盘来判断服务器是否出现故障将特别准确。
附图说明
图1示出根据本发明一个实施方式所构建的存储***的架构示意图。
图2示出根据本发明的一个实施方式的存储***的架构示意图。
具体实施方式
下文将参考附图更完整地描述本公开内容,其中在附图中显示了本公开内容的实施方式。但是这些实施方式可以用许多不同形式来实现并且不应该被解释为限于本文所述的实施方式。相反地,提供这些实例以使得本公开内容将是透彻和完整的,并且将全面地向本领域的熟练技术人员表达本公开内容的范围。
下面结合附图以示例的方式详细描述本发明的各种实施方式。
图1示出根据本发明的实施方式的存储***的架构示意图。如图1所示,该存储***包括存储网络;存储节点,连接至所述存储网络,其中,所述存储节点是提供存储服务的软件模块,而非通常意义上的包含存储介质在内的硬件服务器;以及存储设备,同样连接至所述存储网络。每个存储设备包括至少一个存储介质。其中,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质。
利用本发明实施例提供的存储***,每一个存储节点都能够无需借助其他存储节点而访问所有存储介质,从而使得本发明所有的存储介质都实际上被所有的存储节点共享,进而实现了全局存储池的效果。
同时,从上述的描述可以看出,相比于现有技术,存储节点位于存储介质侧,或者严格来说,存储介质是存储节点所在物理机的内置盘;本发明实施例中,存储节点所在的物理机独立于存储设备,存储设备更多作为连接存储介质与存储网络的一个通道。
根据本发明的实施方式,还可以使得在需要进行动态平衡时,无需将物理数据在不同的存储介质中进行迁移,只需要通过配置平衡不同的存储节点所管理的存储介质即可。
在本发明另一实施例中,存储节点侧进一步包括计算节点,并且计算节点和存储节点设置在一台物理服务器中,该物理服务器通过存储网络与存储设备连接。利用本发明实施方式所构建的将计算节点和存储节点位于同一物理机的聚合式存储***,从整体结构而言,可以减少所需物理设备的数量,从而降低成本。同时,计算节点也可以在本地访问到其希望访问的存储资源。另外,由于将计算节点和存储节点聚合在同一台物理服务器上,两者之间数据交换可以简单到仅仅是共享内存,性能特别优异。
本发明实施例提供的存储***中,计算节点到存储介质之间的I/O数据路径长度包括:(1)存储介质到存储节点;以及(2)存储节点到聚合在同一物理服务器的计算节点(CPU总线通路)。而相比之下,图1所示现有技 术的存储***,其计算节点到存储介质之间的I/O数据路径长度包括:(1)存储介质到存储节点;(2)存储节点到存储网络接入网交换机;(3)存储网络接入网交换机到核心网交换机;(4)核心网交换机到计算网络接入网交换机;以及(5)计算网络接入网交换机到计算节点。显然,本发明实施方式的存储***的总数据路径只接近于传统存储***的第(1)项。即,本发明实施例提供的存储***,通过对I/O数据路径长度的极致压缩能够极大地提高了存储***的I/O通道性能,其实际运行效果非常接近于读写本地硬盘的I/O通道。
在本发明一实施例中,存储节点可以是物理服务器的一个虚拟机、一个容器或直接运行在服务器的物理操作***上的一个模块,计算节点也可以是同一个物理机服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作***上的一个模块。在一个实施例中,每个存储节点可以对应一个或多个计算节点。
具体而言,可以将一台物理服务器分成多个虚拟机,其中一台虚拟机做存储节点用,其它虚拟机做计算节点用;也可是利用物理OS上的一个模块做存储节点用,以便实现更好的性能。
在本发明一实施例中,形成虚拟机的虚拟化技术可以是KVM或Zen或VMware或Hyper-V虚拟化技术,形成所述容器的容器技术可以是Docker或Rockett或Odin或Chef或LXC或Vagrant或Ansible或Zone或Jail或Hyper-V容器技术。
在本发明一实施例中,各个存储节点同时只负责管理固定的存储介质,并且一个存储介质不会同时被多个存储节点进行写入,以避免数据冲突,从而能够实现每一个存储节点都能够无需借助其他存储节点而访问由其管理的存储介质,并且能够保证存储***中存储的数据的完整性。
在本发明一实施例中,可以将***中所有的存储介质按照存储逻辑进行划分,具体而言,可以将整个***的存储池划分为存储区域、存储组、 存储块这样的逻辑存储层级架构,其中,存储块为最小存储单位。在本发明一实施例中,可以将存储池划分成至少两个存储区域。
在本发明一实施例中,每一个存储区域可以分为至少一个存储组。在一个较优的实施例中,每个存储区域至少被划分为两个存储组。
在一些实施例中,存储区域和存储组是可以合并的,从而可以在该存储层级架构中省略一个层级。
在本发明一实施例中,每个存储区域(或者存储组)可以由至少一个存储块组成,其中存储块可以是一个完整的存储介质、也可以是一个存储介质的一部分。为了在存储区域内部构建冗余存储,每个存储区域(或者存储组)可以由至少两个存储块组成,当其中任何一个存储块出现故障时,可以从该组中其余存储块中计算出完整的被存储数据。冗余存储方式可以为多副本模式、独立冗余磁盘阵列(RAID)模式、纠删码(erase code)模式。在本发明一实施例中,冗余存储方式可以通过ZFS文件***建立。在本发明一实施例中,为了对抗存储设备/存储介质的硬件故障,每个存储区域(或者存储组)所包含的多个存储块不会位于同一个存储介质中,甚至也不位于同一个存储设备中。在本发明一实施例中,每个存储区域(或者存储组)所包含的任何两个存储块都不会位于同一个存储介质/存储设备中。在本发明另一实施例中,同一存储区域(或者存储组)中位于同一存储介质/存储设备的存储块数量最好小于或等于冗余存储的冗余度。举例说明,当存储冗余采取的RAID 5方式时,其冗余存储的冗余度为1,那么位于同一存储设备的同一存储组的存储块数量最多为1;对RAID6,其冗余存储的冗余度为2,那么位于同一存储设备的同一存储组的存储块数量最多为2。
在本发明一实施例中,每个存储节点都只能读和写自己管理的存储区域。由于多个存储节点对同一个存储块的读操作并不会互相冲突,而多个存储节点同时写一个存储块容易发生冲突,因此,在另一个实施例中,可以是每个存储节点只能写自己管理的存储区域,但是可以读自己管理的存 储区域以及其它存储节点管理的存储区域,即写操作是局域性的,但读操作可以是全局性。
在一个实施方式中,存储***还可以包括存储控制节点,其连接至存储网络,用于确定每个存储节点管理的存储区域。在另一个实施方式中,每个存储节点可以包括存储分配模块,用于确定该存储节点所管理的存储区域,这可以通过每个存储节点所包括的各个存储分配模块之间的通信和协调处理算法来实现,该算法可以例如以使得各个存储节点之间的负载均衡为原则。
在一个实施例中,在监测到一个存储节点发生故障时,可以对其他部分或全部存储节点进行配置,使得这些存储节点接管之前由所述发生故障的存储节点管理的存储区域。例如,可以由其中一个存储节点接管出现故障的存储节点管理的存储区域,或者,可以由其它至少两个存储节点进行接管,其中每个存储节点接管出现故障的存储节点管理的部分的存储区域,比如其他至少两个存储节点分别接管该存储区域内的不同存储组。
在一个实施例中,存储介质可以包括但不限于硬盘、闪存、SRAM、DRAM、NVME或其它形式,存储介质的访问接口可以包括但不限于SAS接口、SATA接口、PCI/e接口、DIMM接口、NVMe接口、SCSI接口、AHCI接口。
在本发明一实施例中,存储网络可以包括至少一个存储交换设备,通过其中包括的存储交换设备之间的数据交换来实现存储节点对存储介质的访问。具体而言,存储节点和存储介质分别通过存储通道与存储交换设备连接。根据本发明的实施方式,提供了一种支持多点控制的存储***,其中的单一存储空间可以通过多个通道访问,例如由计算节点进行访问。
在本发明一实施例中,存储交换设备可以是SAS交换机或PCI/e交换机,对应地,存储通道可以是SAS(串行连接SCSI)通道或PCI/e通道。
以SAS通道为例,相比传统的基于IP协议的存储方案,基于SAS交换 的方案,拥有着性能高,带宽大,单台设备磁盘数量多等优点。在与主机适配器(HBA)或者服务器主板上的SAS接口结合使用后,SAS体系所提供的存储能够很容易的被连接的多台服务器同时访问。
具体而言,SAS交换机到存储设备之间通过一根SAS线连接,存储设备与存储介质之间也是由SAS接口连接,比如,存储设备内部将SAS通道连到每个存储介质(可以在存储设备内部设置一个SAS交换芯片)。由于SAS网络的带宽可以达到24Gb或48Gb,是千兆以太网的几十倍,以及昂贵的万兆以太网的数倍;同时在链路层SAS比IP网有大约一个数量级的提升,在传输层,由于TCP协议三次握手四次关闭,开销很高且TCP的延迟确认机制和慢启动有时会导致100毫秒级的延时,SAS协议的延时只有TCP的几十分之一,性能有更大的提升。总之,SAS网络比基于以太网的TCP/IP在带宽、延时性方面具有巨大优势。本领域技术人员可以理解,PCI/e通道的性能也可以适应***的需求。
在本发明一实施例中,存储网络可以包括至少两个存储交换设备,所述每个存储节点都可以通过任意一个存储交换设备连接到任何一个存储设备,进而连接至存储介质。当任何一个存储交换设备或连接到一个存储交换设备的存储通道出现故障时,存储节点通过其它存储交换设备读写存储设备上的数据。
参考图2,其示出了根据本发明一个实施方式所构建的一个具体的存储***30。存储***30中的存储设备被构建成多台JBOD307-310,分别通过SAS数据线连接至两个SAS交换机305和306,这两个SAS交换机构成了存储***所包括的存储网络的交换核心。前端为至少两个服务器301和302,每台服务器通过HBA设备(未示出)或主板上SAS接口连接至这两个SAS交换机305和306。服务器之间存在基本的网络连接用来监控和通信。每台服务器中都有一个存储节点,利用从SAS链路获取的信息,管理所有JBOD磁盘中的部分或全部磁盘。具体而言,可以利用本申请文件以 上描述的存储区域、存储组、存储块来将JBOD磁盘划分成不同的存储组。每个存储节点都管理一组或多组这样的存储组。当每个存储组内部采用冗余存储的方式时,可以将冗余存储的元数据存在于磁盘之上,使得冗余存储能够被其他存储节点直接从磁盘识别。
在所示的示例性存储***30中,存储节点可以安装监控和管理模块,负责监控本地存储和其它服务器的状态。当某台JBOD整体异常,或者JBOD上某个磁盘异常时,数据可靠性由冗余存储来确保。当某台服务器故障时,另一台预先设定好的服务器上的存储节点中的管理模块,将按照磁盘上的数据,在本地识别并接管原来由故障服务器的存储节点所管理的磁盘。故障服务器的存储节点原本对外提供的存储服务,也将在新的服务器上的存储节点得到延续。至此,实现了一种全新的高可用的全局存储池结构。
可见,所构建的示例性存储***30提供了一种多点可控的、全局访问的存储池。硬件方面使用多台服务器来对外提供服务,使用JBOD来存放磁盘。将多台JBOD各自连接两台SAS交换机,两台交换机再分别连接服务器的HBA卡,从而确保JBOD上所有磁盘,能够被所有服务器访问。SAS冗余链路也确保了链路上的高可用性。
在每台服务器本地,利用冗余存储技术,从每台JBOD上选取磁盘组成冗余存储,避免单台JBOD的损失造成数据不可用。当一台服务器失效时,对整体状态进行监控的模块将调度另一台服务器,通过SAS通道访问失效服务器的存储节点所管理的磁盘,快速接管对方负责的这些磁盘,实现高可用的全局存储。
虽然在图2中是以JBOD存放磁盘为例进行了说明,但是应当理解,如图2所示的本发明的实施方式还支持JBOD以外的存储设备。另外,以上是以一块存储介质(整个的)作为一个存储块为例,也同样适用于将一个存储介质的一部分作为一个存储块的情形。
在本发明一实施例中,可以通过如下方式监控每台服务器是否失效:将全局存储池划分成至少两个存储区域,并且从该至少两个存储区域中选取一个存储区域作为全局仲裁盘。每个存储节点都能够对所述全局仲裁盘进行读和写操作,但同时只负责管理剩余存储区域中(除全局仲裁盘所在存储区域外)的零到多个存储区域。
根据本发明实施方式,全局仲裁盘由服务器上层应用、即存储节点使用,即每一个存储节点都可以直接对全局仲裁盘进行读写。由于存储访问的多点控制的特性,每个存储节点都能够同步的看到其他存储节点更新的内容。
在本发明一个实施例中,全局仲裁盘的存储空间被划分成多个固定分块,所述多个固定分块中的每一个分别被分配给所述一个或多个存储节点中的每个存储节点,从而可以避免多个控制节点对于仲裁盘的并发读写冲突。
在本发明一个实施例中,全局仲裁盘可以被配置为:一个或多个存储节点中的每个存储节点在使用全局仲裁盘时,只能对分配给其的固定分块执行写操作,而对分配给其他存储节点的固定分块执行读操作。使得存储节点可以在更新自己的状态同时了解其他存储节点的状态变化。
在本发明一实施例中,在全局仲裁盘上可以设置有选举锁。当一个存储节点出现故障时,其余存储节点利用选举锁机制选举出接管节点。特别是当某一存储节点具有特殊功能,并且该具有特殊功能的存储节点出现故障时,选举锁机制的价值就更大。
具体而言,作为一个存储区域的全局仲裁盘也可以具有如上讨论的存储区域的特征。在本发明一实施例中,全局仲裁盘包括一个或多个存储介质,或包括一个或多个存储介质的部分或全部。同时,全局仲裁盘所包括的存储介质可以位于同一个或不同的存储设备中。
举例说明,全局仲裁盘可以由一个完整的存储介质组成,也可以由2个完整的存储介质组成,或者是2个存储介质的部分构成,还可以是一个存储介质的部分以及另一个或几个完整的存储介质构成。
在本发明一实施例中,全局仲裁盘可以由至少两个存储设备上的至少两个存储介质的全部或部分以冗余存储的方式构成。
以JBOD作为存储介质为例进行说明,由于每一台存储节点服务器都能访问所有JBOD上的存储资源,因此可以从每台JBOD的一块或多块磁盘中抽取部分存储空间,组合作为全局的仲裁盘来使用。通过控制仲裁盘的分布,可以很容易的提升仲裁盘的可靠性。最严苛的情况下,***里只有一台JBOD存活时,仲裁盘仍然能够工作。
利用本发明实施例提供的存储***,由于每台物理服务器上的计算节点(虚拟机、容器等)的自身存储也都是在全局存储池中,具体而言,是位于与仲裁盘同样的共享存储池中。计算节点与存储节点对全局存储池的正常读写与存储节点读写仲裁盘走的是同样的存储通道,这种情况下,一台服务器如果无法读写仲裁盘,不管是服务器出现了故障还是相关存储通道出现了故障,该服务器上的计算节点也肯定无法正常工作,即无法访问其正常的存储资源。因此通过这种仲裁盘结构来判断对应的计算节点是否有效工作是非常可靠的。
具体而言,每个存储节点都会持续往仲裁盘写入数据,同时,每个存储节点也会持续监听(通过读的方式)其他存储节点是否有定期往仲裁盘写入数据,一旦发现某个存储节点未准时往仲裁盘写入数据,则可以判断该存储节点对应的计算节点未正常工作。
存储节点持续往仲裁盘写入心跳数据的方式为存储节点以***预设的时间间隔定期往仲裁盘写入心跳数据,比如每五秒往仲裁盘内写入数据。
应当理解,为了不模糊本发明的实施方式,说明书仅对一些关键、未必必要的技术和特征进行了描述,而可能未对一些本领域技术人员能够实现的特征做出说明。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换等,均应包含在本 发明的保护范围之内。

Claims (10)

  1. 一种存储***,其特征在于,包括:
    存储网络;
    至少两个存储节点,连接至所述存储网络;以及
    至少一个存储设备,连接至所述存储网络,每个存储设备包括至少一个存储介质,并且至少一个存储设备所包括的所有存储介质构成一个共享存储池;
    其中,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质,并且,
    将所述共享存储池划分成至少两个存储区域,并且从所述至少两个存储区域中选取一个存储区域作为全局仲裁盘。
  2. 根据权利要求1所述的存储***,其特征在于,所述全局仲裁盘的存储空间被划分成多个固定分块,所述多个固定分块中的每一个分别被分配给所述一个或多个存储节点中的每个存储节点;所述每个存储节点能够写分配给其的固定分块,能够读分配给其他存储节点的固定分块。
  3. 根据权利要求2所述的存储***,其特征在于,所述全局仲裁盘包括一个或多个存储介质,或包括一个或多个存储介质的部分或全部;或所述全局仲裁盘所包括的存储介质可以位于同一个或不同的存储设备中。
  4. 根据权利要求3所述的存储***,其特征在于,所述全局仲裁盘由至少两个存储设备上的至少两个存储介质的全部或部分以冗余存储的方式构成。
  5. 根据权利要求1所述的存储***,其特征在于,在所述全局仲裁盘上设置有选举锁;当一个存储节点出现故障时,其余存储节点利用所述选举锁机制选举出接管节点。
  6. 根据权利要求1所述的存储***,其特征在于,所述存储网络是SAS网络或PCI/e网络。
  7. 根据权利要求1所述的存储***,其特征在于,所述存储节点持久化保存数据的位置是所述共享存储池。
  8. 根据权利要求8所述的存储***,其特征在于,所述存储***进一步包括一个或多个计算节点,每个计算节点通过与之对应的存储节点读写持久化保存数据;其中,所述计算节点与对应的存储节点都位于同一物理服务器。
  9. 根据权利要求1至8所述的存储***,其特征在于,所述每个存储节点持续往仲裁盘写入心跳数据,同时,每个存储节点都会持续监听其他存储节点是否持续往仲裁盘写入心跳数据,一旦发现某个存储节点未准时往仲裁盘写入心跳数据,可以判断该存储节点出现故障。
  10. 根据权利要求9所述的存储***,其特征在于,存储节点持续往仲裁盘写入心跳数据的方式为存储节点以***预设的时间间隔定期往仲裁盘写入心跳数据。
PCT/CN2017/077755 2011-10-11 2017-03-22 存储*** WO2017167106A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/139,712 US10782898B2 (en) 2016-02-03 2018-09-24 Data storage system, load rebalancing method thereof and access control method thereof
US16/378,076 US20190235777A1 (en) 2011-10-11 2019-04-08 Redundant storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610181228.4 2016-03-26
CN201610181228.4A CN105872031B (zh) 2016-03-26 2016-03-26 存储***

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2017/077754 Continuation-In-Part WO2017162177A1 (zh) 2011-10-11 2017-03-22 冗余存储***、冗余存储方法和冗余存储装置
PCT/CN2017/077757 Continuation-In-Part WO2017162178A1 (zh) 2011-10-11 2017-03-22 对存储***的访问控制方法及装置

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2017/077757 Continuation-In-Part WO2017162178A1 (zh) 2011-10-11 2017-03-22 对存储***的访问控制方法及装置
US16/054,536 Continuation-In-Part US20180341419A1 (en) 2011-10-11 2018-08-03 Storage System

Publications (1)

Publication Number Publication Date
WO2017167106A1 true WO2017167106A1 (zh) 2017-10-05

Family

ID=56625057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077755 WO2017167106A1 (zh) 2011-10-11 2017-03-22 存储***

Country Status (2)

Country Link
CN (1) CN105872031B (zh)
WO (1) WO2017167106A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105872031B (zh) * 2016-03-26 2019-06-14 天津书生云科技有限公司 存储***
CN105472047B (zh) * 2016-02-03 2019-05-14 天津书生云科技有限公司 存储***
CN110244904B (zh) * 2018-03-09 2020-08-28 杭州海康威视***技术有限公司 一种数据存储***、方法及装置
CN109840247B (zh) * 2018-12-18 2020-12-18 深圳先进技术研究院 文件***及数据布局方法
CN109951331B (zh) * 2019-03-15 2021-08-20 北京百度网讯科技有限公司 用于发送信息的方法、装置和计算集群
CN111212141A (zh) * 2020-01-02 2020-05-29 中国科学院计算技术研究所 一种共享存储***
CN115359834B (zh) * 2022-10-18 2023-03-24 苏州浪潮智能科技有限公司 一种盘仲裁区域检测方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582013A (zh) * 2009-06-10 2009-11-18 成都市华为赛门铁克科技有限公司 一种在分布式存储中处理存储热点的方法、装置及***
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群***
CN203982354U (zh) * 2014-06-19 2014-12-03 天津书生投资有限公司 一种冗余存储***
US9110591B2 (en) * 2011-04-22 2015-08-18 Hewlett-Packard Development Company, L.P. Memory resource provisioning using SAS zoning
CN105872031A (zh) * 2016-03-26 2016-08-17 天津书生云科技有限公司 存储***

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9690818B2 (en) * 2009-12-01 2017-06-27 Sybase, Inc. On demand locking of retained resources in a distributed shared disk cluster environment
US8443231B2 (en) * 2010-04-12 2013-05-14 Symantec Corporation Updating a list of quorum disks
CN104219318B (zh) * 2014-09-15 2018-02-13 北京联创信安科技股份有限公司 一种分布式文件存储***及方法
CN104657316B (zh) * 2015-03-06 2018-01-19 北京百度网讯科技有限公司 服务器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582013A (zh) * 2009-06-10 2009-11-18 成都市华为赛门铁克科技有限公司 一种在分布式存储中处理存储热点的方法、装置及***
US9110591B2 (en) * 2011-04-22 2015-08-18 Hewlett-Packard Development Company, L.P. Memory resource provisioning using SAS zoning
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群***
CN203982354U (zh) * 2014-06-19 2014-12-03 天津书生投资有限公司 一种冗余存储***
CN105872031A (zh) * 2016-03-26 2016-08-17 天津书生云科技有限公司 存储***

Also Published As

Publication number Publication date
CN105872031A (zh) 2016-08-17
CN105872031B (zh) 2019-06-14

Similar Documents

Publication Publication Date Title
WO2017133483A1 (zh) 存储***
WO2017167106A1 (zh) 存储***
WO2017162179A1 (zh) 用于存储***的负载再均衡方法及装置
US10642704B2 (en) Storage controller failover system
WO2017162177A1 (zh) 冗余存储***、冗余存储方法和冗余存储装置
US8898385B2 (en) Methods and structure for load balancing of background tasks between storage controllers in a clustered storage environment
US11137940B2 (en) Storage system and control method thereof
US8010829B1 (en) Distributed hot-spare storage in a storage cluster
WO2017162176A1 (zh) 存储***、存储***的访问方法和存储***的访问装置
US9542320B2 (en) Multi-node cache coherency with input output virtualization
US7996608B1 (en) Providing redundancy in a storage system
US8041987B2 (en) Dynamic physical and virtual multipath I/O
JP5523468B2 (ja) 直接接続ストレージ・システムのためのアクティブ−アクティブ・フェイルオーバー
US20190235777A1 (en) Redundant storage system
US10782898B2 (en) Data storage system, load rebalancing method thereof and access control method thereof
US20110145452A1 (en) Methods and apparatus for distribution of raid storage management over a sas domain
US7434107B2 (en) Cluster network having multiple server nodes
WO2017162178A1 (zh) 对存储***的访问控制方法及装置
US8788753B2 (en) Systems configured for improved storage system communication for N-way interconnectivity
US10901626B1 (en) Storage device
US10318393B2 (en) Hyperconverged infrastructure supporting storage and compute capabilities
US10782989B2 (en) Method and device for virtual machine to access storage device in cloud computing management platform
US11201788B2 (en) Distributed computing system and resource allocation method
US11188425B1 (en) Snapshot metadata deduplication
Dell

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17773136

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17773136

Country of ref document: EP

Kind code of ref document: A1