US20190196969A1 - Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system - Google Patents

Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system Download PDF

Info

Publication number
US20190196969A1
US20190196969A1 US15/971,349 US201815971349A US2019196969A1 US 20190196969 A1 US20190196969 A1 US 20190196969A1 US 201815971349 A US201815971349 A US 201815971349A US 2019196969 A1 US2019196969 A1 US 2019196969A1
Authority
US
United States
Prior art keywords
workload
ssds
degree
range
spike
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/971,349
Inventor
Zhengyu Yang
Morteza HOSEINZADEH
Thomas David Evans
Clay Mayers
Thomas Bolt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to US15/971,349 priority Critical patent/US20190196969A1/en
Assigned to SAMSUNG ELECTRONICS CO., LTD. reassignment SAMSUNG ELECTRONICS CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: EVANS, THOMAS DAVID, MAYERS, CLAY, HOSEINZADEH, MORTEZA, YANG, Zhengyu
Priority to KR1020180143031A priority patent/KR20190084203A/en
Priority to CN201811441364.8A priority patent/CN109962969B/en
Publication of US20190196969A1 publication Critical patent/US20190196969A1/en
Priority to US17/006,285 priority patent/US11403220B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0664Virtualisation aspects at device level, e.g. emulation of a storage device or system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/125Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1019Random or heuristic server selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2206/00Indexing scheme related to dedicated interfaces for computers
    • G06F2206/10Indexing scheme related to storage interfaces for computers, indexing schema related to group G06F3/06
    • G06F2206/1012Load balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/22Employing cache memory using specific memory technology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/608Details relating to cache mapping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • H04L43/0882Utilisation of link capacity

Definitions

  • the present disclosure relates generally to load balancing of a cloud computing storage system, and more particularly, to adaptive load balancing in a cache solid state device (SSD) tier of a multi-tier storage system based on the degree of a workload spike.
  • SSD cache solid state device
  • SSDs generally have an input/output (I/O) performance that is orders of magnitude faster than that of traditional hard disk drives (HDDs).
  • I/O input/output
  • HDDs hard disk drives
  • SSDs are utilized as a performance cache tier in modern datacenter storage systems. These SSDs are intended to absorb hot datasets to reduce the long I/O response time that would occur if these I/Os were forwarded to low-end SSDs or even slower HDDs.
  • an apparatus includes a memory and a processor.
  • the processor is configured to detect a degree of a change in a workload in an I/O stream received through a network from one or more user devices.
  • the processor is also configured to determine a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within.
  • the processor is further configured to determine a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of SSDs in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range.
  • the processor is configured to distribute the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
  • a method in which a processor of an application server layer detects a degree of a change in a workload in an I/O stream received through a network from one or more user devices.
  • the processor determines a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within.
  • the processor determines a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of SSDs in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range.
  • the processor distributes the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
  • FIG. 1 is a diagram illustrating a system architecture that includes an adaptive cache load balancer (ACLB), according to an embodiment of the present disclosure
  • FIG. 2 is diagram illustrating an example of workload spike detection, according to an embodiment of the present disclosure
  • FIG. 3 is flowchart illustrating a method for detecting and compensating for a workload spike, according to an embodiment of the present disclosure
  • FIG. 4 is a diagram illustrating the dispatching of jobs of an I/O stream based on the detected workload spike, according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart illustrating a method for adaptive cache load balancing, according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart illustrating a method for selection of SSDs when a strong workload spike is detected, according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart illustrating a method for selection of SSDs when there is no workload spike is detected, according to an embodiment of the present disclosure.
  • FIG. 8 is a block diagram illustrating an illustrative hardware implementation of a computing system, according to an embodiment of the present disclosure.
  • first, second, etc. may be used for describing various elements, the structural elements are not restricted by the terms. The terms are only used to distinguish one element from another element. For example, without departing from the scope of the present disclosure, a first structural element may be referred to as a second structural element. Similarly, the second structural element may also be referred to as the first structural element. As used herein, the term “and/or” includes any and all combinations of one or more associated items.
  • An adaptive cache load balancer (ACLB) is described herein that utilizes a spike-aware algorithm to detect a workload change (i.e., I/O spike).
  • a workload change i.e., I/O spike
  • the ACLB adaptively uses different strategies to distribute the workload across SSDs in the performance cache tier to improve performance and extend the lifetimes of the SSDs.
  • the distribution is performed by balancing usage of one or more types of SSD resources, such as, for example, throughput, bandwidth, storage space, and a worn-out level (i.e., program/erase (P/E) cycle usage).
  • P/E program/erase
  • FIG. 1 a diagram illustrates a system architecture that includes the ACLB, according to an embodiment of the present disclosure.
  • the system architecture includes a cloud user layer 102 , an application server layer 104 , and a centralized multi-tier storage pool 106 .
  • the cloud user layer 102 includes heterogeneous application users that send requests through their devices to the cloud through a network 103 . Each user may have a different request pattern resulting in differing temporal and spatial distributions of data.
  • the application server layer 104 includes multiple physical virtual machine servers 108 - 1 , 108 - 2 , 108 - n , each of which includes one or more virtual machines (VMs) 110 - 1 , 110 - 2 , 110 - n.
  • VMs virtual machines
  • VMs run the guest operating system (OS) and applications, and are isolated from each other. Cloud service vendors may “rent” these VMs to application service users. VMs may have different workload patterns based on user applications, and thus, they will have different levels of sensitivity to storage device speeds.
  • a VM hypervisor 112 hosts the one or more VMs 110 - 1 , 110 - 2 , 110 - n for a given physical server 108 - 1 .
  • the VM hypervisor 112 is responsible for scheduling, resource management, system software application programming interface (API), and hardware virtualization.
  • API system software application programming interface
  • An ACLB daemon 114 is installed on the VM hypervisor 112 of each physical server 108 - 1 , and receives input from respective ACLB I/O filters 116 - 1 , 116 - 2 , 116 - n that correspond to each of the VMs 110 - 1 , 110 - 2 , 110 - n.
  • the ACLB I/O filters 116 - 1 , 116 - 2 , 116 - n are responsible for collecting I/O-related statistics of every VM.
  • the data may be collected at a sample period rate and the results may be sent to the ACLB Daemon 114 on the host system responsible for collecting all data from all VMs.
  • the ACLB Daemon 114 tracks the workload change (e.g., the I/O access pattern change) of the physical server 108 - 1 , and collects device runtime performance information from the ACLB controller 118 and the ACLB I/O filters 116 - 1 , 116 - 2 , 116 - n.
  • the ACLB may consist of the ACLB I/O filters 116 - 1 , 116 - 2 , 116 - n , the ACLB daemons 114 , and the ACLB controller 118 .
  • the ACLB controller (or processor) 118 may be running on a dedicated server or on an embedded system in the storage system.
  • the ACLB controller 118 is responsible for making I/O allocation decisions.
  • the physical servers 108 - 1 , 108 - 2 , 108 - n are connected to the centralized multi-tier storage pool layer 106 through fiber channels, or some other means, to share the backend SSD-HDD hybrid storage systems or all flash multiple tier SSD storage systems, which include, for example, non-volatile memory express (NVMe) SSDs, 3D cross point (XPoint) NVM SSDs, multi-level cell (MLC)/triple-level cell (TLC)/quad-level cell (QLC) SSDs, and traditional spinning HDDs.
  • NVMe non-volatile memory express
  • XPoint 3D cross point
  • MLC multi-level cell
  • TLC triple-level cell
  • QLC quad-level cell
  • Each tier of the storage pool 106 has different specialties, such as, for example, fast speed, large storage capacity, etc.
  • the tiers include a cache SSD tier 120 , a capacity SSD tier 122 , and a capacity HDD tier 124 .
  • Embodiments of the present disclosure focus on load balancing of the workload in a top tier, which is usually the cache SSD tier 120 (or performance cache tier) consisting of the fastest and most expensive SSDs in the centralized multi-tier storage pool layer 106 .
  • a multi-strategy load balancer provides different strategies under different workload spike scenarios.
  • Performance cache tier SSD information is periodically retrieved, and based on workload spike detection results, different strategies are utilized for I/O stream allocation.
  • the performance cache tier SSD information is retrieved by calling a workload spike detection component.
  • the ACLB categorizes a current I/O stream into one of three different ranges, and one of three different corresponding strategies is used to allocate the I/O stream.
  • each individual I/O request is not considered since it is too fine-grained and expensive for conducting an optimization calculation.
  • an I/O stream is considered, which is defined as a batch of I/O requests having the following proprieties:
  • I/O streams may be assigned by associating each application write thread operation with a stream, so that an I/O stream provides better endurance and improved performance.
  • an index of dispersion I is used to detect a workload spike in the incoming traffic.
  • the advantage of utilizing the index of dispersion I is that it can qualitatively capture spikes in a single number, and thus, provide a simple yet powerful way to promptly identify the start and the end of a spike period.
  • Equation (1) The mathematical definition of the index of dispersion I of a stochastic process is provided in Equation (1) below:
  • SCV is the squared-coefficient of variation
  • ACF(k) is the autocorrelation function at lag K.
  • is a knob to adjust the weight of the autocorrelation function ADF(k).
  • the index of dispersion I may only be based on one aspect, such as, for example, working volume, or may be based on multiple aspects, such as, for example, a comprehensive equation based on:
  • FIG. 2 is a diagram illustrating an example of workload spike detection, according to an embodiment of the present disclosure. As shown, the diagram has an x-axis of time and a y-axis of index of dispersion I, which provides a degree of a detected workload spike. However, the y-axis may be any type of spike measurement value.
  • the ACLB categorizes workload spike degrees into three ranges: strong spike ( ⁇ S ), weak spike ( ⁇ W ), and non-spike ( ⁇ N ), based on corresponding preset range thresholds. For example, [0, 30] may be set as a non-spike range, [30, 70] may be set as a weak spike range, and [70, 100] may be set as a strong spike range.
  • the ACLB uses a join shortest queue (JSQ)-based runtime random-greedy algorithm. Specifically, during runtime, ACLB selects the top K number of SSDs with the shortest queues, and conducts random allocation among these top K SSDs. For cases involving no spike detection, an optimization framework calculation is performed.
  • JSQ join shortest queue
  • a workload spike detector 302 is used to detect the existence and magnitude of a workload spike.
  • the workload spike detector may detect a strong spike in block 304 , a weak spike in block 306 , or may not detect a spike in block 308 . If a strong spike is detected in block 304 , the JSQ-based runtime random-greedy algorithm 310 is used in accordance with a large range (K H ) of the idlest, or least busy SSDs, in block 312 .
  • K H large range
  • the JSQ-based runtime random-greedy algorithm 310 is used in accordance with a small range (K L ) of the idlest SSDs, in block 314 . If no spike is detected in block 308 , the optimization framework calculation is performed, in block 316 .
  • FIG. 4 is a diagram illustrating the dispatching of jobs of an I/O stream based on the workload spike, according to an embodiment of the present disclosure.
  • a dispatcher 402 dispatches I/O streams from a dispatcher queue 404 to SSDs 406 .
  • the ACLB selects the disk for the I/O based on the optimization framework.
  • the ACLB selects a large range (K H ) of the idlest SSDs and conducts random allocation among these selected SSDs for the I/O job.
  • K L small range
  • FIG. 5 is a flowchart illustrating a method for adaptive cache load balancing, according to an embodiment of the present disclosure.
  • the methodology Upon determining to begin the methodology in steps 502 and 504 , it is determined, for each I/O stream, whether the current time is the beginning of an epoch, in step 506 . If the current time is the beginning of an epoch, load information from all SSDs is updated, in step 508 . If time divided by epoch length results in a 0 remainder, it is the epoch boundary and new load data is obtained. Specifically, if it is the beginning of a new epoch, the ACLB will collect queue information from all SSDs by calling the function updateLoadInfo( ), in step 508 . If the current time is not the beginning of an epoch, the methodology proceeds directly to step 510 using the load data that was already obtained at the start of the current epoch.
  • a degree of a detected workload spike is determined.
  • K H a large range
  • FIG. 6 is a flowchart illustrating a method for selection of SSDs when there is a strong workload spike, according to an embodiment of the present disclosure.
  • ACLB Since K H is close to the total number of SSDs, ACLB has a behavior similar to a random assignment method, which allows the spike workload to be shared among a large number of top idlest SSDs, thereby alleviating the imbalance of load.
  • step 512 if it is determined in step 512 that the spike degree is not within the strong spike range ⁇ S , it is determined whether the spike degree is within the weak spike range ⁇ W , in step 516 . If it is determined that the spike degree is within the weak spike range ⁇ W , the JSQ-based runtime random-greedy algorithm is used in accordance with a small range (K L ) of the idlest SSDs, in step 518 . This algorithm functions in the same manner as described in FIG. 6 , only using K L SSDs instead of K H SSDs. When it is not the beginning of each epoch, the ACLB allocates the head I/O request from the dispatcher queue randomly to the top K idlest SSDs. Once complete, the ACLB removes the I/O request from the dispatcher queue.
  • K L small range
  • step 516 if it is determined in step 516 that the spike degree is not within the weak spike range ⁇ W , it is determined whether the spike degree is within the non-spike range ( ⁇ N ), in step 520 . If it is determined that the spike degree is in the non-spike range ( ⁇ N ), a runtime optimization framework is performed over all K SSDs, in step 522 . If it is determined that the spike degree is not within the non-spike range ( ⁇ N ), the methodology returns to step 502 and repeats.
  • FIG. 7 is a flowchart illustrating a method for selection of SSDs when no workload spike is detected, according to an embodiment of the present disclosure.
  • the ACLB has time to conduct full optimization, in which strives to balance resources, such as, for example, throughput, bandwidth, space, and a worn-out level.
  • step 702 of FIG. 7 it is determined whether the queue is empty and whether the current time is the beginning of an epoch.
  • job j is set to the head of the dispatcher queue, in step 704 .
  • step 706 it is determined whether the methodology has reached the end of the SSD list. If the methodology has reached the end of the SSD list, the methodology proceeds to step 716 .
  • the ACLB calculates a coefficient variation (CV) for an SSD based on each type of resource (not limited to the four types described below), in step 708 .
  • CV coefficient variation
  • CV is used to score a degree of balance for each type of average resource utilization rate, as set forth below in Equations (2) through (6).
  • Each type of resource may have a different weight based on the environment preference, as reflected by ⁇ S .
  • P max , B max , S max , L max are preset upper bounds of a utilization ratio for each type of resource.
  • Average utilization can be in a monitoring window, or an exponentially weighted moving average (EWMA) window that averages the data in a way that gives less and less weight to data as they are further removed in time.
  • EWMA exponentially weighted moving average
  • the ACLB is required to know the runtime usage information of the storage pool, such as, for example, throughput, bandwidth, storage, and P/E cycle.
  • ⁇ P ⁇ L ⁇ WA( s,j ) (7)
  • ⁇ P and ⁇ L are the physical and logical write amount (in bytes), respectively.
  • the logical write is known from the application side.
  • WA(s, j) is the write amplification function, which takes the SSD and new I/O stream to be assigned to this SSD, and then returns the write implication factor.
  • step 710 it is determined whether CV returns a value of ⁇ 1, which indicates that any one of the resources exceeds its corresponding resource upper bound. If a value of ⁇ 1 is returned for the CV, the SSD is skipped and a next SSD is selected in step 712 . If the resources do not exceed the resource upper bound, a result of the CV equation is added into a CV_Rec vector that stores results of the equation, in step 714 , before selecting a next SSD, in step 712 , and returning to step 706 .
  • the CV_Rec vector is defined as ⁇ CV(Util(ResourceType1)), CV(Util(ResourceType2)), CV(Util(ResourceType3)) . . . >.
  • step 706 the ACLB picks the minimal CV result (as also shown in Equation (2)) and corresponding SSD for assignment of the job j, and also removes the job from the dispatcher queue, in step 716 .
  • the methodology then returns to step 702 to repeat.
  • FIG. 8 a block diagram illustrates an illustrative hardware implementation of a computing system in accordance with which one or more components/methodologies of the disclosure (e.g., components/methodologies described in the context of FIGS. 1-7 ) may be implemented.
  • the computer system may be implemented in accordance with a processor 810 , a memory 812 , input/output (I/O) devices 814 , and a network interface 816 , coupled via a computer bus 818 or alternate connection arrangement.
  • processor is intended to include any processing device, such as, for example, one that includes, but is not limited to, a central processing unit (CPU) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
  • CPU central processing unit
  • processor may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
  • memory is intended to include memory associated with a processor or CPU, such as, for example, random access memory (RAM), read only memory (ROM), a fixed memory device (e.g., hard drive), a removable memory device, and flash memory.
  • RAM random access memory
  • ROM read only memory
  • fixed memory device e.g., hard drive
  • removable memory device e.g., hard disk
  • flash memory e.g., hard disk drives
  • input/output devices or “I/O devices”, as used herein, is intended to include, for example, one or more input devices for entering information into the processor or processing unit, and/or one or more output devices for outputting information associated with the processing unit.
  • network interface is intended to include, for example, one or more transceivers to permit the computer system to communicate with another computer system via an appropriate communications protocol. This may provide access to other computer systems.
  • Software components, including instructions or code, for performing the methodologies described herein may be stored in one or more of the associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU.
  • ROM read-only memory
  • RAM random access memory
  • the present disclosure may be utilized in conjunction with the manufacture of integrated circuits, which are considered part of the methods and apparatuses described herein.
  • Embodiments of the present disclosure detect and predict workload change (i.e., I/O spike prediction), and provide three different strategies for different spike degrees to better balance the loads across SSDs to improve performance and extend lifetime. Workload is balanced transparently to the user, VM I/O performance is improved, and SSD lifetime is extended.
  • workload change i.e., I/O spike prediction

Abstract

An apparatus, a method, a method of manufacturing an apparatus, and a method of constructing an integrated circuit are provided. A processor of an application server layer detects a degree of a change in a workload in an input/output stream received through a network from one or more user devices. The processor determines a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within. The processor determines a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of solid state devices (SSDs) in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range. The processor distributes the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.

Description

    PRIORITY
  • This application claims priority under 35 U.S.C. § 119(e) to a U.S. Provisional Patent Application filed on Dec. 22, 2017 in the United States Patent and Trademark Office and assigned Ser. No. 62/609,871, the entire contents of which are incorporated herein by reference.
  • FIELD
  • The present disclosure relates generally to load balancing of a cloud computing storage system, and more particularly, to adaptive load balancing in a cache solid state device (SSD) tier of a multi-tier storage system based on the degree of a workload spike.
  • BACKGROUND
  • SSDs generally have an input/output (I/O) performance that is orders of magnitude faster than that of traditional hard disk drives (HDDs). In view of this difference in performance, SSDs are utilized as a performance cache tier in modern datacenter storage systems. These SSDs are intended to absorb hot datasets to reduce the long I/O response time that would occur if these I/Os were forwarded to low-end SSDs or even slower HDDs.
  • Different applications have different workload characteristics. When such applications are used in coordination with modern datacenter storage systems (e.g., cloud computing storage systems), in order to satisfy peak use, efficient resource allocation schemes are required. However, traditional load balancers, such as, for example, random and greedy algorithms, neglect cases that involve a spike in workload, and experience significant performance degradation.
  • SUMMARY
  • According to one embodiment, an apparatus is provided that includes a memory and a processor. The processor is configured to detect a degree of a change in a workload in an I/O stream received through a network from one or more user devices. The processor is also configured to determine a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within. The processor is further configured to determine a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of SSDs in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range. Finally, the processor is configured to distribute the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
  • According to one embodiment, a method is provided in which a processor of an application server layer detects a degree of a change in a workload in an I/O stream received through a network from one or more user devices. The processor determines a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within. The processor determines a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of SSDs in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range. The processor distributes the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The above and other aspects, features, and advantages of certain embodiments of the present disclosure will be more apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:
  • FIG. 1 is a diagram illustrating a system architecture that includes an adaptive cache load balancer (ACLB), according to an embodiment of the present disclosure;
  • FIG. 2 is diagram illustrating an example of workload spike detection, according to an embodiment of the present disclosure;
  • FIG. 3 is flowchart illustrating a method for detecting and compensating for a workload spike, according to an embodiment of the present disclosure;
  • FIG. 4 is a diagram illustrating the dispatching of jobs of an I/O stream based on the detected workload spike, according to an embodiment of the present disclosure;
  • FIG. 5 is a flowchart illustrating a method for adaptive cache load balancing, according to an embodiment of the present disclosure;
  • FIG. 6 is a flowchart illustrating a method for selection of SSDs when a strong workload spike is detected, according to an embodiment of the present disclosure;
  • FIG. 7 is a flowchart illustrating a method for selection of SSDs when there is no workload spike is detected, according to an embodiment of the present disclosure; and
  • FIG. 8 is a block diagram illustrating an illustrative hardware implementation of a computing system, according to an embodiment of the present disclosure.
  • DETAILED DESCRIPTION
  • Hereinafter, embodiments of the present disclosure are described in detail with reference to the accompanying drawings. It should be noted that the same elements will be designated by the same reference numerals although they are shown in different drawings. In the following description, specific details such as detailed configurations and components are merely provided to assist with the overall understanding of the embodiments of the present disclosure. Therefore, it should be apparent to those skilled in the art that various changes and modifications of the embodiments described herein may be made without departing from the scope of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. The terms described below are terms defined in consideration of the functions in the present disclosure, and may be different according to users, intentions of the users, or customs. Therefore, the definitions of the terms should be determined based on the contents described herein.
  • The present disclosure may have various modifications and various embodiments, among which embodiments are described below in detail with reference to the accompanying drawings. However, it should be understood that the present disclosure is not limited to the embodiments, but includes all modifications, equivalents, and alternatives within the scope of the present disclosure.
  • Although the terms including an ordinal number such as first, second, etc. may be used for describing various elements, the structural elements are not restricted by the terms. The terms are only used to distinguish one element from another element. For example, without departing from the scope of the present disclosure, a first structural element may be referred to as a second structural element. Similarly, the second structural element may also be referred to as the first structural element. As used herein, the term “and/or” includes any and all combinations of one or more associated items.
  • The terms used herein are merely used to describe various embodiments of the present disclosure but are not intended to limit the present disclosure. Singular forms are intended to include plural forms unless the context clearly indicates otherwise. In the present disclosure, it should be understood that the terms “include” or “have” indicate existence of a feature, a number, a step, an operation, a structural element, parts, or a combination thereof, and do not exclude the existence or probability of the addition of one or more other features, numerals, steps, operations, structural elements, parts, or combinations thereof.
  • Unless defined differently, all terms used herein have the same meanings as those understood by a person skilled in the art to which the present disclosure belongs. Terms such as those defined in a generally used dictionary are to be interpreted to have the same meanings as the contextual meanings in the relevant field of art, and are not to be interpreted to have ideal or excessively formal meanings unless clearly defined in the present disclosure.
  • An adaptive cache load balancer (ACLB) is described herein that utilizes a spike-aware algorithm to detect a workload change (i.e., I/O spike). In response to detecting the workload change, the ACLB adaptively uses different strategies to distribute the workload across SSDs in the performance cache tier to improve performance and extend the lifetimes of the SSDs. The distribution is performed by balancing usage of one or more types of SSD resources, such as, for example, throughput, bandwidth, storage space, and a worn-out level (i.e., program/erase (P/E) cycle usage).
  • Referring initially to FIG. 1, a diagram illustrates a system architecture that includes the ACLB, according to an embodiment of the present disclosure. Specifically, the system architecture includes a cloud user layer 102, an application server layer 104, and a centralized multi-tier storage pool 106.
  • The cloud user layer 102 includes heterogeneous application users that send requests through their devices to the cloud through a network 103. Each user may have a different request pattern resulting in differing temporal and spatial distributions of data.
  • The application server layer 104 includes multiple physical virtual machine servers 108-1, 108-2, 108-n, each of which includes one or more virtual machines (VMs) 110-1, 110-2, 110-n.
  • VMs run the guest operating system (OS) and applications, and are isolated from each other. Cloud service vendors may “rent” these VMs to application service users. VMs may have different workload patterns based on user applications, and thus, they will have different levels of sensitivity to storage device speeds. A VM hypervisor 112 hosts the one or more VMs 110-1, 110-2, 110-n for a given physical server 108-1. The VM hypervisor 112 is responsible for scheduling, resource management, system software application programming interface (API), and hardware virtualization. An ACLB daemon 114 is installed on the VM hypervisor 112 of each physical server 108-1, and receives input from respective ACLB I/O filters 116-1, 116-2, 116-n that correspond to each of the VMs 110-1, 110-2, 110-n.
  • The ACLB I/O filters 116-1, 116-2, 116-n are responsible for collecting I/O-related statistics of every VM. The data may be collected at a sample period rate and the results may be sent to the ACLB Daemon 114 on the host system responsible for collecting all data from all VMs.
  • The ACLB daemon 114 from each of the physical servers 108-1, 108-2, 108-n communicates with a single ACLB controller 118 within the application server layer 104. The ACLB Daemon 114 tracks the workload change (e.g., the I/O access pattern change) of the physical server 108-1, and collects device runtime performance information from the ACLB controller 118 and the ACLB I/O filters 116-1, 116-2, 116-n.
  • The ACLB may consist of the ACLB I/O filters 116-1, 116-2, 116-n, the ACLB daemons 114, and the ACLB controller 118.
  • The ACLB controller (or processor) 118 may be running on a dedicated server or on an embedded system in the storage system. The ACLB controller 118 is responsible for making I/O allocation decisions.
  • The physical servers 108-1, 108-2, 108-n are connected to the centralized multi-tier storage pool layer 106 through fiber channels, or some other means, to share the backend SSD-HDD hybrid storage systems or all flash multiple tier SSD storage systems, which include, for example, non-volatile memory express (NVMe) SSDs, 3D cross point (XPoint) NVM SSDs, multi-level cell (MLC)/triple-level cell (TLC)/quad-level cell (QLC) SSDs, and traditional spinning HDDs. Each tier of the storage pool 106 has different specialties, such as, for example, fast speed, large storage capacity, etc. The tiers include a cache SSD tier 120, a capacity SSD tier 122, and a capacity HDD tier 124. Embodiments of the present disclosure focus on load balancing of the workload in a top tier, which is usually the cache SSD tier 120 (or performance cache tier) consisting of the fastest and most expensive SSDs in the centralized multi-tier storage pool layer 106.
  • Herein, a multi-strategy load balancer provides different strategies under different workload spike scenarios. Performance cache tier SSD information is periodically retrieved, and based on workload spike detection results, different strategies are utilized for I/O stream allocation. Specifically, the performance cache tier SSD information is retrieved by calling a workload spike detection component. Based on the spike degree results, the ACLB categorizes a current I/O stream into one of three different ranges, and one of three different corresponding strategies is used to allocate the I/O stream.
  • Herein, each individual I/O request is not considered since it is too fine-grained and expensive for conducting an optimization calculation. Instead, an I/O stream is considered, which is defined as a batch of I/O requests having the following proprieties:
      • 1. correlated with applications (i.e., from the same application);
      • 2. storing similar lifetime data into the same erase block and reducing write amplification (garbage collection overhead), has locality patterns; and
      • 3. all associated data invalidated at the same time (e.g., updated, trimmed, unmapped, deallocated, etc.).
  • With multi-stream SSDs, I/O streams may be assigned by associating each application write thread operation with a stream, so that an I/O stream provides better endurance and improved performance.
  • Workload spike detection methods are not limited strictly to those disclosed herein. However, in one example, an index of dispersion I is used to detect a workload spike in the incoming traffic. The advantage of utilizing the index of dispersion I is that it can qualitatively capture spikes in a single number, and thus, provide a simple yet powerful way to promptly identify the start and the end of a spike period. The mathematical definition of the index of dispersion I of a stochastic process is provided in Equation (1) below:
  • I = SCV ( 1 + α · k [ k , K max ] ACF ( k ) ) ( 1 )
  • where SCV is the squared-coefficient of variation and ACF(k) is the autocorrelation function at lag K. α is a knob to adjust the weight of the autocorrelation function ADF(k). The joint presence of SCV and autocorrelations in the index of dispersion I is sufficient to discriminate traces with different spike intensities, and thus, to capture changes in user demands.
  • The index of dispersion I may only be based on one aspect, such as, for example, working volume, or may be based on multiple aspects, such as, for example, a comprehensive equation based on:
      • 1. working volume size (WVS): the total amount of data (in bytes) accessed in the disk; or
      • 2. working set size (WSS): the total address range (in bytes) of accessed data, which is the unique set of WVS. A large working set covers more disk space.
  • Additionally, the following factors may also be included in consideration of workload spike detection:
      • 1. read/write ratio (RWR): the number of write I/Os divided by the total number of I/Os.
      • 2. sequential/random ratio (SRR): the amount (in bytes) of total sequential I/Os (both read and write) divided by the total I/O amount (in bytes). In general, SSDs have better performance under sequential I/Os than under random I/Os.
  • FIG. 2 is a diagram illustrating an example of workload spike detection, according to an embodiment of the present disclosure. As shown, the diagram has an x-axis of time and a y-axis of index of dispersion I, which provides a degree of a detected workload spike. However, the y-axis may be any type of spike measurement value. The ACLB categorizes workload spike degrees into three ranges: strong spike (ρS), weak spike (ρW), and non-spike (ρN), based on corresponding preset range thresholds. For example, [0, 30] may be set as a non-spike range, [30, 70] may be set as a weak spike range, and [70, 100] may be set as a strong spike range.
  • For cases involving the detection of both strong and weak spikes, The ACLB uses a join shortest queue (JSQ)-based runtime random-greedy algorithm. Specifically, during runtime, ACLB selects the top K number of SSDs with the shortest queues, and conducts random allocation among these top K SSDs. For cases involving no spike detection, an optimization framework calculation is performed.
  • Referring now to FIG. 3, a flowchart illustrates a method for detecting and compensating for a workload spike, according to an embodiment of the present disclosure. A workload spike detector 302 is used to detect the existence and magnitude of a workload spike. The workload spike detector may detect a strong spike in block 304, a weak spike in block 306, or may not detect a spike in block 308. If a strong spike is detected in block 304, the JSQ-based runtime random-greedy algorithm 310 is used in accordance with a large range (KH) of the idlest, or least busy SSDs, in block 312. If a weak spike is detected in block 306, the JSQ-based runtime random-greedy algorithm 310 is used in accordance with a small range (KL) of the idlest SSDs, in block 314. If no spike is detected in block 308, the optimization framework calculation is performed, in block 316.
  • FIG. 4 is a diagram illustrating the dispatching of jobs of an I/O stream based on the workload spike, according to an embodiment of the present disclosure. A dispatcher 402 dispatches I/O streams from a dispatcher queue 404 to SSDs 406. In detail, under the non-spike scenario, the ACLB selects the disk for the I/O based on the optimization framework. Under the strong spike scenario, the ACLB selects a large range (KH) of the idlest SSDs and conducts random allocation among these selected SSDs for the I/O job. Under the weak spike scenario, the ACLB selects a small range (KL) of the idlest SSDs and conducts random allocation among these selected SSDs for the I/O job. Full optimization take a greater amount of time and is therefore not suitable for spike cases.
  • FIG. 5 is a flowchart illustrating a method for adaptive cache load balancing, according to an embodiment of the present disclosure. Upon determining to begin the methodology in steps 502 and 504, it is determined, for each I/O stream, whether the current time is the beginning of an epoch, in step 506. If the current time is the beginning of an epoch, load information from all SSDs is updated, in step 508. If time divided by epoch length results in a 0 remainder, it is the epoch boundary and new load data is obtained. Specifically, if it is the beginning of a new epoch, the ACLB will collect queue information from all SSDs by calling the function updateLoadInfo( ), in step 508. If the current time is not the beginning of an epoch, the methodology proceeds directly to step 510 using the load data that was already obtained at the start of the current epoch.
  • In step 510, a degree of a detected workload spike is determined. In step 512, it is determined whether the degree of the workload spike is within the strong spike range ρS. If the spike degree is within the strong spike range ρS, a JSQ-based runtime random-greedy algorithm is used in accordance with a large range (KH) of the idlest SSDs, in step 514. Referring back to the embodiment of FIG. 4, the total number of available SSDs is 10 (K=10). In cases involving the detection of a strong spike, the number of best SSD candidates KH should be set close to (or equal to) the total number of available SSDs. As shown in FIG. 4, KH=8.
  • Specifically, in the JSQ-based runtime random-greedy algorithm, the ACLB conducts random allocation among the top KH SSDs. FIG. 6 is a flowchart illustrating a method for selection of SSDs when there is a strong workload spike, according to an embodiment of the present disclosure. After determining that the dispatcher queue is not empty and the current time is not the beginning of an epoch, in step 602, the ACLB ascendingly sorts SSDs by the number of active I/O streams that are queued, randomly picks an SSD in the KH idlest SSDs to assign the current job, and removes the job from the dispatcher queue 404, in step 604.
  • Since KH is close to the total number of SSDs, ACLB has a behavior similar to a random assignment method, which allows the spike workload to be shared among a large number of top idlest SSDs, thereby alleviating the imbalance of load.
  • Referring back to FIG. 5, if it is determined in step 512 that the spike degree is not within the strong spike range ρS, it is determined whether the spike degree is within the weak spike range ρW, in step 516. If it is determined that the spike degree is within the weak spike range ρW, the JSQ-based runtime random-greedy algorithm is used in accordance with a small range (KL) of the idlest SSDs, in step 518. This algorithm functions in the same manner as described in FIG. 6, only using KL SSDs instead of KH SSDs. When it is not the beginning of each epoch, the ACLB allocates the head I/O request from the dispatcher queue randomly to the top K idlest SSDs. Once complete, the ACLB removes the I/O request from the dispatcher queue.
  • In cases involving the detection of a weak spike, the number of best SSD candidates KL is set to a lower bound of the range. For example, as shown in FIG. 4, KL=3. Also of note, if KL=1, ACLB performs exactly the same as the “greedy” load balancer, which always selects the idlest SSD with the shortest queue length, and thus, achieves good performance in terms of waiting time.
  • Referring back to FIG. 5, if it is determined in step 516 that the spike degree is not within the weak spike range ρW, it is determined whether the spike degree is within the non-spike range (ρN), in step 520. If it is determined that the spike degree is in the non-spike range (ρN), a runtime optimization framework is performed over all K SSDs, in step 522. If it is determined that the spike degree is not within the non-spike range (ρN), the methodology returns to step 502 and repeats.
  • FIG. 7 is a flowchart illustrating a method for selection of SSDs when no workload spike is detected, according to an embodiment of the present disclosure. Specifically, in cases involving no workload spike detection, the ACLB has time to conduct full optimization, in which strives to balance resources, such as, for example, throughput, bandwidth, space, and a worn-out level.
  • In step 702 of FIG. 7, it is determined whether the queue is empty and whether the current time is the beginning of an epoch. When the queue is not empty and the current time is not the beginning of an epoch, job j is set to the head of the dispatcher queue, in step 704. In step 706 it is determined whether the methodology has reached the end of the SSD list. If the methodology has reached the end of the SSD list, the methodology proceeds to step 716.
  • If the methodology has not reached the end of the SSD list, the ACLB calculates a coefficient variation (CV) for an SSD based on each type of resource (not limited to the four types described below), in step 708.
  • First, resources upon which load balance is conducted are specified. Specifically, there are multiple types of resources that SSDs provide, such as:
      • 1. throughput (P): unit in I/O request per second (IOPS);
      • 2. bandwidth (B): unit in bytes per second (BPS);
      • 3. storage space (S): unit in bytes, related to workload address space, or working set size; and
      • 4. program/erase (P/E) life cycles (L): each cell has limited lifetime, different workload patterns (such as sequential ratio, read write ratio) will have different write amplification on the SSD.
  • Ideally usage of all these resources would be balanced among the SSDs. Furthermore, to support both homogeneous and heterogeneous SSDs, a percentage format is utilized. More factors may also be included in the consideration of load balancing.
  • Second, it must be determined how to balance all resource utilization rates of all types of resources. CV is used to score a degree of balance for each type of average resource utilization rate, as set forth below in Equations (2) through (6).
  • min:

  • ωP·CV(Util(P))+ωB·CV(Util(B))+ωS·CV(Util(S))+ωL·CV(Util(L))   (2)
  • subject to:

  • Util(P)∈[0, Pmax]  (3)

  • Util(B)∈[0, Bmax]  (4)

  • Util(S)∈[0, Smax]  (5)

  • Util(L)∈[0, Lmax]  (6)
  • Each type of resource may have a different weight based on the environment preference, as reflected by ωS. Pmax, Bmax, Smax, Lmax are preset upper bounds of a utilization ratio for each type of resource. Average utilization can be in a monitoring window, or an exponentially weighted moving average (EWMA) window that averages the data in a way that gives less and less weight to data as they are further removed in time.
  • Accordingly, the ACLB is required to know the runtime usage information of the storage pool, such as, for example, throughput, bandwidth, storage, and P/E cycle.
  • Throughput, bandwidth, and storage can be obtained by calling APIs of the hypervisor (112 of FIG. 1). However, for the P/E cycle, SSD vendors usually do not provide APIs to allow users to check real physical write amount. Thus, the actual physical write amount is estimated to calculate the usage of the P/E cycle based on a history record of jobs dispatched from the job scheduler. This can be estimated by using write amplification function (WAF) models, as set forth in Equation (7).

  • λPL·WA(s,j)   (7)
  • where λP and λL are the physical and logical write amount (in bytes), respectively. The logical write is known from the application side. WA(s, j) is the write amplification function, which takes the SSD and new I/O stream to be assigned to this SSD, and then returns the write implication factor.
  • It may be costly to continuously pull the throughput, bandwidth, and storage from a storage pool with a large number of SSDs. Therefore, information is periodically pulled.
  • Referring back to FIG. 7, in step 710, it is determined whether CV returns a value of −1, which indicates that any one of the resources exceeds its corresponding resource upper bound. If a value of −1 is returned for the CV, the SSD is skipped and a next SSD is selected in step 712. If the resources do not exceed the resource upper bound, a result of the CV equation is added into a CV_Rec vector that stores results of the equation, in step 714, before selecting a next SSD, in step 712, and returning to step 706. The CV_Rec vector is defined as <CV(Util(ResourceType1)), CV(Util(ResourceType2)), CV(Util(ResourceType3)) . . . >. Upon determining the end of the SSD list, in step 706, the ACLB picks the minimal CV result (as also shown in Equation (2)) and corresponding SSD for assignment of the job j, and also removes the job from the dispatcher queue, in step 716. The methodology then returns to step 702 to repeat.
  • Referring now to FIG. 8, a block diagram illustrates an illustrative hardware implementation of a computing system in accordance with which one or more components/methodologies of the disclosure (e.g., components/methodologies described in the context of FIGS. 1-7) may be implemented. As shown, the computer system may be implemented in accordance with a processor 810, a memory 812, input/output (I/O) devices 814, and a network interface 816, coupled via a computer bus 818 or alternate connection arrangement.
  • It is to be appreciated that the term “processor”, as used herein, is intended to include any processing device, such as, for example, one that includes, but is not limited to, a central processing unit (CPU) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
  • The term “memory”, as used herein, is intended to include memory associated with a processor or CPU, such as, for example, random access memory (RAM), read only memory (ROM), a fixed memory device (e.g., hard drive), a removable memory device, and flash memory.
  • In addition, the phrase “input/output devices” or “I/O devices”, as used herein, is intended to include, for example, one or more input devices for entering information into the processor or processing unit, and/or one or more output devices for outputting information associated with the processing unit.
  • Still further, the phrase “network interface”, as used herein, is intended to include, for example, one or more transceivers to permit the computer system to communicate with another computer system via an appropriate communications protocol. This may provide access to other computer systems.
  • Software components, including instructions or code, for performing the methodologies described herein may be stored in one or more of the associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU.
  • The present disclosure may be utilized in conjunction with the manufacture of integrated circuits, which are considered part of the methods and apparatuses described herein.
  • Embodiments of the present disclosure detect and predict workload change (i.e., I/O spike prediction), and provide three different strategies for different spike degrees to better balance the loads across SSDs to improve performance and extend lifetime. Workload is balanced transparently to the user, VM I/O performance is improved, and SSD lifetime is extended.
  • Although certain embodiments of the present disclosure have been described in the detailed description of the present disclosure, the present disclosure may be modified in various forms without departing from the scope of the present disclosure. Thus, the scope of the present disclosure shall not be determined merely based on the described embodiments, but rather determined based on the accompanying claims and equivalents thereto.

Claims (18)

What is claimed is:
1. An apparatus, comprising:
a memory; and
a processor configured to:
detect a degree of a change in a workload in an input/output stream received through a network from one or more user devices;
determine a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within;
determine a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of solid state devices (SSDs) in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range; and
distribute the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
2. The apparatus of claim 1, wherein each of the plurality of distribution strategies corresponds to a respective one of the plurality of preset degree ranges.
3. The apparatus of claim 2, wherein the plurality of preset degree ranges comprises a strong workload spike range, a weak workload spike range, and a non-spike range.
4. The apparatus of claim 3, wherein the plurality of distribution strategies comprises:
a join shortest queue (JSQ)-based runtime random-greedy algorithm used in accordance with a large range of idlest SSDs from the plurality of SSDs, which corresponds to the strong workload spike range;
a JSQ-based runtime random-greedy algorithm used in accordance with a small range of idlest SSDs from the plurality of SSDs, which corresponds to the weak workload spike range; and
an optimization framework calculation, which corresponds to the non-spike range.
5. The apparatus of claim 4, wherein, when the determined distribution strategy comprises the JSQ-based runtime random-greed algorithm, the processor is further configured to:
sort the plurality of SSDs by a number of active I/O streams that are queued; and
randomly select an SSD from the large range of idlest SSDs or the small range of idlest SSDs for assignment of a job of the workload.
6. The apparatus of claim 4, wherein, when the determined distribution strategy comprises the optimization framework calculation, the processor is further configured to:
calculate a coefficient variation for each of the plurality of SSDs using a plurality of resources;
determine whether any one of the plurality of resources exceeds a respective upper bound for each of the plurality of SSDs based on the respective coefficient variation;
skip assignment to a given SSD, when any one of the plurality of resources exceeds the respective upper bound; and
choose an SSD with a minimal coefficient variation result for assignment of a job of the workload.
7. The apparatus of claim 1, wherein the degree of the change of the workload is calculated as an index of dispersion I:
I = SCV ( 1 + α · k [ k , K max ] ACF ( k ) )
where SCV is a squared coefficient of variation and ACF(k) is an autocorrelation function at lag K.
8. The apparatus of claim 1, wherein the degree of the change of the workload is determined based on working volume, working volume size, or working set size.
9. The apparatus of claim 8, wherein the degree of the change of the workload is further determined based on at least one of a read/write ratio and a sequential/random ratio.
10. A method, comprising:
detecting, by a processor of an application server layer, a degree of a change in a workload in an input/output stream received through a network from one or more user devices;
determining, by the processor, a degree range, from a plurality of preset degree ranges, that the degree of the change in the workload is within;
determining, by the processor, a distribution strategy, from among a plurality of distribution strategies, to distribute the workload across one or more of a plurality of solid state devices (SSDs) in a performance cache tier of a centralized multi-tier storage pool, based on the determined degree range; and
distributing, by the processor, the workload across the one or more of the plurality of solid state devices based on the determined distribution strategy.
11. The method of claim 10, wherein each of the plurality of distribution strategies corresponds to a respective one of the plurality of preset degree ranges
12. The method of claim 11, wherein the plurality of preset degree ranges comprises a strong workload spike range, a weak workload spike range, and a non-spike range.
13. The method of claim 12, wherein the plurality of distribution strategies comprises:
a join shortest queue (JSQ)-based runtime random-greedy algorithm used in accordance with a large range of idlest SSDs from the plurality of SSDs, which corresponds to the strong workload spike range;
a JSQ-based runtime random-greedy algorithm used in accordance with a small range of idlest SSDs from the plurality of SSDs, which corresponds to the weak workload spike range; and
an optimization framework calculation, which corresponds to the non-spike range.
14. The method of claim 13, wherein, when the determined distribution strategy comprises the JSQ-based runtime random-greed algorithm, the processor is further configured to:
sort the plurality of SSDs by a number of active I/O streams that are queued; and
randomly select an SSD from the large range of idlest SSDs or the small range of idlest SSDs for assignment of a job of the workload.
15. The method of claim 13, wherein, when the determined distribution strategy comprises the optimization framework calculation, further comprising:
calculating a coefficient variation for each of the plurality of SSDs using a plurality of resources;
determining whether any one of the plurality of resources exceeds a respective upper bound for each of the plurality of SSDs based on the respective coefficient variation;
skipping assignment to a given SSD, when any one of the plurality of resources exceeds the respective upper bound; and
choosing an SSD with a minimal coefficient variation result for assignment of a job of the workload.
16. The method of claim 10, wherein the degree of the change of the workload is calculated as an index of dispersion I:
I = SCV ( 1 + α · k [ k , K max ] ACF ( k ) )
where SCV is a squared coefficient of variation and ACF(k) is an autocorrelation function at lag K.
17. The method of claim 10, wherein the degree of the change of the workload is determined based on working volume, working volume size, or working set size.
18. The method of claim 17, wherein the degree of the change of the workload is further determined based on at least one of a read/write ratio and a sequential/random ratio.
US15/971,349 2017-12-22 2018-05-04 Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system Abandoned US20190196969A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US15/971,349 US20190196969A1 (en) 2017-12-22 2018-05-04 Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system
KR1020180143031A KR20190084203A (en) 2017-12-22 2018-11-19 Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system
CN201811441364.8A CN109962969B (en) 2017-12-22 2018-11-29 Method and device for self-adaptive cache load balancing of cloud computing storage system
US17/006,285 US11403220B2 (en) 2017-12-22 2020-08-28 Method and apparatus for adaptive cache load balancing for SSD-based cloud computing storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762609871P 2017-12-22 2017-12-22
US15/971,349 US20190196969A1 (en) 2017-12-22 2018-05-04 Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/006,285 Continuation US11403220B2 (en) 2017-12-22 2020-08-28 Method and apparatus for adaptive cache load balancing for SSD-based cloud computing storage system

Publications (1)

Publication Number Publication Date
US20190196969A1 true US20190196969A1 (en) 2019-06-27

Family

ID=66951163

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/971,349 Abandoned US20190196969A1 (en) 2017-12-22 2018-05-04 Method and apparatus for adaptive cache load balancing for ssd-based cloud computing storage system
US17/006,285 Active 2038-05-10 US11403220B2 (en) 2017-12-22 2020-08-28 Method and apparatus for adaptive cache load balancing for SSD-based cloud computing storage system

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/006,285 Active 2038-05-10 US11403220B2 (en) 2017-12-22 2020-08-28 Method and apparatus for adaptive cache load balancing for SSD-based cloud computing storage system

Country Status (3)

Country Link
US (2) US20190196969A1 (en)
KR (1) KR20190084203A (en)
CN (1) CN109962969B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200174695A1 (en) * 2018-12-03 2020-06-04 Western Digital Technologies, Inc. Storage System and Method for Stream Management in a Multi-Host Virtualized Storage System
US11016889B1 (en) 2019-12-13 2021-05-25 Seagate Technology Llc Storage device with enhanced time to ready performance
US11061611B2 (en) * 2019-02-21 2021-07-13 International Business Machines Corporation Dynamically altered data distribution workload on a storage system
US11144226B2 (en) 2019-04-11 2021-10-12 Samsung Electronics Co., Ltd. Intelligent path selection and load balancing
US11216190B2 (en) 2019-06-10 2022-01-04 Samsung Electronics Co., Ltd. Systems and methods for I/O transmissions in queue pair-based NVMeoF initiator-target system
US11240294B2 (en) 2019-08-23 2022-02-01 Samsung Electronics Co., Ltd. Systems and methods for spike detection and load balancing resource management
US20220329478A1 (en) * 2021-04-09 2022-10-13 At&T Intellectual Property I, L.P. Adaptive spare equipment allocation techniques
US20220374149A1 (en) * 2021-05-21 2022-11-24 Samsung Electronics Co., Ltd. Low latency multiple storage device system
US11782624B2 (en) 2020-10-06 2023-10-10 Samsung Electronics Co., Ltd. Worflow-based partition allocation

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221477B (en) * 2020-01-10 2023-08-22 烽火云科技有限公司 OSD (on Screen display) disc distribution method and system
CN111708492B (en) * 2020-06-10 2023-04-25 深圳证券通信有限公司 Data distribution uneven adjustment method based on ceph
KR20230116470A (en) 2022-01-28 2023-08-04 경북대학교 산학협력단 Operating method of hybird index system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080092143A1 (en) * 2006-09-29 2008-04-17 Hideyuki Koseki Storage apparatus and load balancing method
US20090271515A1 (en) * 2008-04-28 2009-10-29 Arun Kwangil Iyengar Method and Apparatus for Load Balancing in Network Based Telephony Application
US20150212840A1 (en) * 2014-01-30 2015-07-30 International Business Machines Corporation Optimized Global Capacity Management in a Virtualized Computing Environment
US20160269501A1 (en) * 2015-03-11 2016-09-15 Netapp, Inc. Using a cache cluster of a cloud computing service as a victim cache
US20170024316A1 (en) * 2015-07-23 2017-01-26 Qualcomm Incorporated Systems and methods for scheduling tasks in a heterogeneous processor cluster architecture using cache demand monitoring
US20170091032A1 (en) * 2015-09-24 2017-03-30 International Business Machines Corporation Generating additional slices based on data access frequency

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7321926B1 (en) 2002-02-11 2008-01-22 Extreme Networks Method of and system for allocating resources to resource requests
US8296772B2 (en) * 2007-07-02 2012-10-23 International Business Machines Corporation Customer information control system workload management based upon target processors requesting work from routers
US20110173626A1 (en) * 2010-01-12 2011-07-14 Nec Laboratories America, Inc. Efficient maintenance of job prioritization for profit maximization in cloud service delivery infrastructures
CN102300330B (en) * 2011-08-05 2014-05-14 郑侃 Downlink resource scheduling method
US8930592B2 (en) 2013-02-13 2015-01-06 Vmware, Inc. Multipath load balancing optimizations for alua storage systems
EP3079060B1 (en) 2015-04-08 2018-03-28 Huawei Technologies Co., Ltd. Load balancing for large in-memory databases
US10599352B2 (en) 2015-08-14 2020-03-24 Samsung Electronics Co., Ltd. Online flash resource allocation manager based on a TCO model
US10120817B2 (en) 2015-09-30 2018-11-06 Toshiba Memory Corporation Device and method for scheduling commands in a solid state drive to reduce peak power consumption levels

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080092143A1 (en) * 2006-09-29 2008-04-17 Hideyuki Koseki Storage apparatus and load balancing method
US20090271515A1 (en) * 2008-04-28 2009-10-29 Arun Kwangil Iyengar Method and Apparatus for Load Balancing in Network Based Telephony Application
US20150212840A1 (en) * 2014-01-30 2015-07-30 International Business Machines Corporation Optimized Global Capacity Management in a Virtualized Computing Environment
US20160269501A1 (en) * 2015-03-11 2016-09-15 Netapp, Inc. Using a cache cluster of a cloud computing service as a victim cache
US20170024316A1 (en) * 2015-07-23 2017-01-26 Qualcomm Incorporated Systems and methods for scheduling tasks in a heterogeneous processor cluster architecture using cache demand monitoring
US20170091032A1 (en) * 2015-09-24 2017-03-30 International Business Machines Corporation Generating additional slices based on data access frequency

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200174695A1 (en) * 2018-12-03 2020-06-04 Western Digital Technologies, Inc. Storage System and Method for Stream Management in a Multi-Host Virtualized Storage System
US11182101B2 (en) * 2018-12-03 2021-11-23 Western Digital Technologies, Inc. Storage system and method for stream management in a multi-host virtualized storage system
US11061611B2 (en) * 2019-02-21 2021-07-13 International Business Machines Corporation Dynamically altered data distribution workload on a storage system
US11144226B2 (en) 2019-04-11 2021-10-12 Samsung Electronics Co., Ltd. Intelligent path selection and load balancing
US11740815B2 (en) 2019-04-11 2023-08-29 Samsung Electronics Co., Ltd. Intelligent path selection and load balancing
US11216190B2 (en) 2019-06-10 2022-01-04 Samsung Electronics Co., Ltd. Systems and methods for I/O transmissions in queue pair-based NVMeoF initiator-target system
US11240294B2 (en) 2019-08-23 2022-02-01 Samsung Electronics Co., Ltd. Systems and methods for spike detection and load balancing resource management
US11016889B1 (en) 2019-12-13 2021-05-25 Seagate Technology Llc Storage device with enhanced time to ready performance
US11782624B2 (en) 2020-10-06 2023-10-10 Samsung Electronics Co., Ltd. Worflow-based partition allocation
US20220329478A1 (en) * 2021-04-09 2022-10-13 At&T Intellectual Property I, L.P. Adaptive spare equipment allocation techniques
US20220374149A1 (en) * 2021-05-21 2022-11-24 Samsung Electronics Co., Ltd. Low latency multiple storage device system

Also Published As

Publication number Publication date
US11403220B2 (en) 2022-08-02
CN109962969B (en) 2023-05-02
US20200394137A1 (en) 2020-12-17
CN109962969A (en) 2019-07-02
KR20190084203A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
US11403220B2 (en) Method and apparatus for adaptive cache load balancing for SSD-based cloud computing storage system
JP6219512B2 (en) Virtual hadoop manager
US8935500B1 (en) Distributed storage resource scheduler and load balancer
US8782322B2 (en) Ranking of target server partitions for virtual server mobility operations
US8874744B2 (en) System and method for automatically optimizing capacity between server clusters
US7685251B2 (en) Method and apparatus for management of virtualized process collections
JP5744909B2 (en) Method, information processing system, and computer program for dynamically managing accelerator resources
KR102028096B1 (en) Apparatus and method for isolation of virtual machine based on hypervisor
US20170017524A1 (en) Quality of service implementation in a networked storage system with hierarchical schedulers
US10884779B2 (en) Systems and methods for selecting virtual machines to be migrated
US20150026306A1 (en) Method and apparatus for providing virtual desktop service
US10496541B2 (en) Dynamic cache partition manager in heterogeneous virtualization cloud cache environment
US20140196054A1 (en) Ensuring performance of a computing system
CN106471473B (en) Mechanism for controlling server over-allocation in a data center
US10394608B2 (en) Prioritization of low active thread count virtual machines in virtualized computing environment
US20210273996A1 (en) Distributed resource management by improving cluster diversity
US10474383B1 (en) Using overload correlations between units of managed storage objects to apply performance controls in a data storage system
CN109558216B (en) Single root I/O virtualization optimization method and system based on online migration
US20240036756A1 (en) Systems, methods, and devices for partition management of storage resources
JP5879117B2 (en) Information processing system and operation management method
US9021499B2 (en) Moving a logical device between processor modules in response to identifying a varying load pattern
KR101691578B1 (en) Apparatus and method for collecting monitoring data of virtualized apparatus in virtualized environment and computer readable recording medium
Das et al. Mapreduce scheduler: A 360-degree view
WO2023039711A1 (en) Efficiency engine in a cloud computing architecture
CN115858150A (en) Network function energy-saving method based on Openstack virtualization

Legal Events

Date Code Title Description
AS Assignment

Owner name: SAMSUNG ELECTRONICS CO., LTD., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YANG, ZHENGYU;HOSEINZADEH, MORTEZA;EVANS, THOMAS DAVID;AND OTHERS;SIGNING DATES FROM 20180504 TO 20180507;REEL/FRAME:045997/0476

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE