WO2008149348A2 - Method architecture circuit & system for providing caching - Google Patents

Method architecture circuit & system for providing caching Download PDF

Info

Publication number
WO2008149348A2
WO2008149348A2 PCT/IL2008/000750 IL2008000750W WO2008149348A2 WO 2008149348 A2 WO2008149348 A2 WO 2008149348A2 IL 2008000750 W IL2008000750 W IL 2008000750W WO 2008149348 A2 WO2008149348 A2 WO 2008149348A2
Authority
WO
WIPO (PCT)
Prior art keywords
cache
data
caching
requested
control logic
Prior art date
Application number
PCT/IL2008/000750
Other languages
French (fr)
Other versions
WO2008149348A3 (en
Inventor
Yoav Etsion
Dror Feitelson
Original Assignee
Yissum Research Development Company Of The Hebrew University Of Jerusalem
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yissum Research Development Company Of The Hebrew University Of Jerusalem filed Critical Yissum Research Development Company Of The Hebrew University Of Jerusalem
Publication of WO2008149348A2 publication Critical patent/WO2008149348A2/en
Publication of WO2008149348A3 publication Critical patent/WO2008149348A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0895Caches characterised by their organisation or structure of parts of caches, e.g. directory or tag array

Definitions

  • the present invention generally relates to the fields of data processing. More specifically, the present invention relates to methods, architectures, circuits and systems for providing caching.
  • a cache is made up of a pool of entries. Each entry has a datum (a nugget of data) which is a copy of the datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy.
  • the cache client (a CPU, web browser, operating system) wishes to access a datum presumably in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. So, for example, a web browser program might check its local cache on disk to see if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is the tag, and the content of the web page is the datum. The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache.
  • a datum When a datum is written to the cache, it must at some point be written to the backing store as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a synchronous write to the backing store. [008] Alternatively, in a write-back (or write-behind) cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations have been written over (these locations are marked dirty). The data in these locations is written back to the backing store when those data are evicted from the cache. For this reason, a miss in a write-back cache (which requires a block to be replaced by another) may require two memory accesses to service: one to retrieve the needed datum, and one to write replaced data from the cache to the store.
  • Data write-back may be triggered by other policies as well.
  • the client may make many changes to a datum in the cache, and then explicitly notify the cache to write back the datum.
  • No-write allocation is a cache policy where only processor reads are cached, thus avoiding the need for write-back or write-through when the old value of the datum was absent from the cache prior to the write.
  • the data in the backing store may be changed by entities other than the cache, in which case the copy in the cache may become out-of-date or stale.
  • the client updates the data in the cache
  • copies of that data in other caches will become stale.
  • Communication protocols between the cache controllers which keep the data consistent are known as coherency protocols.
  • FIG. 1 shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested by a CPU instruction, which ranges from 1 to 16 bytes. Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag which contains the index of the datum in main memory which has been cached. In a
  • CPU's data cache these entries are called cache lines or cache blocks.
  • Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation look aside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.
  • FIG. 2 there are shown two possible caching methodologies relating to the cache's replacement policy.
  • the replacement policy decides where in the cache a copy of a particular entry of main memory will go. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped.
  • Many caches implement a compromise, and are described as set associative. For example, the level-1 data cache in an AMD Athlon is 2-way set associative, which means that any particular location in main memory can be cached in either of 2 locations in the level-1 data cache.
  • Associativity is a trade-off.
  • Direct mapped cache ⁇ the best (fastest) hit times, and so the best tradeoff for "large” caches
  • One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that datum is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
  • Direct-mapped caches are faster and consume less energy than set- associative caches typically used in Ll caches. However, they are more susceptible to conflict misses than set-associative caches, thus suffering higher miss-rates and achieving lower performance. This deficiency led to abandoning direct-mapped Ll caches in favor of set-associative ones in practically all but embedded processors.
  • the present invention is a method, architecture, circuit and system for providing caching of data.
  • a first cache which first cache may be part of a cache architecture including a second cache and cache control logic.
  • the first cache and cache architecture may be integrally associated with a processor (e.g. Central Processing Unit - CPU) and in such cases may be referred to as processor cache.
  • processor cache e.g. Central Processing Unit - CPU
  • features of the first cache and of the overall caching architecture described in the present application may be functionally associated with any data caching application or any data caching client (e.g. CPU, disk cache, etc.) known today or to be devised in the future.
  • the first cache may be referred to as a bypass cache or bypass filter, and may buffer data that is retrieved from an external memory source such as a computing platform's main random access memory (“RAM”) or from a nonvolatile memory (“NVM”) device functionally associated with the computing platform.
  • the first cache may include a set of memory cells adapted to store both retrieved data and mapping information (e.g. addresses of the received data in the main memory) relating to the retrieved data.
  • mapping information e.g. addresses of the received data in the main memory
  • the first cache may be operated according to any suitable caching technique or algorithm, known today or to be devised in the future, including:
  • 2-way set associative for high-speed CPU caches where even PLRU is too slow.
  • the address of a new item is used to calculate one of two possible locations in the cache where it is allowed to go.
  • the LRU of the two is discarded. This requires one bit per pair of cache lines, to indicate which of the two was the least recently used.
  • LRU Least Recently Used
  • MRU Most Recently Used
  • Pseudo-LRU For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a probabilistic scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
  • Direct-mapped cache for the highest-speed CPU caches where even 2- way set associative caches are too slow. The address of the new item is used to calculate the one location in the cache where it is allowed to go. Whatever was there before is discarded.
  • LFU Least Frequently Used
  • ARC Adaptive Replacement Cache
  • the second cache which may also be referred to as long-term cache, may also be operated according to any caching technique or algorithm known today or to be devised in the future, with direct mapping being a preferred mode of operation.
  • the second cache may receive and store, either from the first cache or directly from main memory, data requested by a caching client such as a processor.
  • the second cache may include a set of memory cells adapted to store both received data and mapping information (e.g. addresses of the received data in the main memory) relating to the received data.
  • the data retrieved into the first cache may be retrieved from an external memory (RAM or NVM), which external memory may either be part of the main memory or main storage of a computing platform, or in the cases where the computing platform has a multilevel cache architecture, the external memory may be an external cache.
  • RAM random access memory
  • NVM non-volatile memory
  • Cache control logic may coordinate the movement of data between: (1) the first and second caches, (2) the caches and the caching client (e.g. processor), and (3) between external data sources or memory (e.g. main memory or NVM) and either of the caches.
  • the cache control logic may use a probabilistic method/process or a periodic sampling method/process to determine when data requested by a functionally associated caching client should be stored in the second cache.
  • the method/process by which the caching logic determines when to store data in a given cache may be referred to as an insertion policy.
  • the caching logic may cause the requested data to be stored in the second cache, for example, by first determining whether to store the data in the second cache and then using a direct mapping algorithm which maps specific locations in the main memory with locations in the second cache.
  • the step of determining performed by the caching logic, or by any circuit functionally associated with the caching logic may be based on a probabilistic model, a simple example of which is: "one in every 10 blocks of requested data will be requested again and thus should be cached.”
  • the caching controller can use a random number generator, or the like, to determine (i.e.
  • the cache control logic may use a predetermined sampling pattern when determining which requested block of data is to be cached in the second cache. For example, the control logic may use a counter and may store to long term cache each N th (e.g. 5 th , 7 th , 10 th , etc..) block of requested data.
  • N th e.g. 5 th , 7 th , 10 th , etc..
  • the control logic may use a lookup table which may indicate which of one or more requested blocks of data in a set of requested data blocks is to be placed in the second cache.
  • the lookup table may include entries indicating insertion into the second cache of the 2nd, 14th, 17th and 25th requested data block from each group/set of 30 data block requests by a caching client.
  • the first cache may be operated using a LRU algorithm and may include a buffer portion or sub-buffer (herein after referred to a "sub-buffer") adapted to store information relating to recent activity in the first cache, for example - recent memory writes to the first cache and recent memory reads from the first cache.
  • the buffer portion or sub-buffer may be updated each time the first cache is operated.
  • the sub- buffer may be checked first for the given data , or for any information relating to the given data which may reduce the time, power, or any other overheads associated in searching for the data in the entire first cache.
  • the sub-buffer which may also be referred to as a recent activity sub-buffer may include information relating to recent data read/write activity in the first cache, and may identify the recently operated upon data (written or read) and its location (e.g. pointer) in the first cache. If the given data, or information about its location in the first cache, is found in the sub-buffer, further examination of the first cache may be avoided. If, however, information about the given data is not found in the sub-buffer, further examination or scanning of the first cache may be required for the given data.
  • the caching logic may first scan the first and/or the second caches so as to determine whether a copy of any of the one or more requested data blocks may be present in the first or second caches. According to some embodiments of the present invention, the caching logic may first scan the second (long-term) cache. If the requested data is found in the second cache, no further scanning may be required. If the data is not found in the second cache, the caching logic may proceed to scan the first cache. Scanning of the second and first caches may be performed by any scanning methodology known today or to be devised in the future.
  • the first and/or the second caches may be scanned using a recent activity sub-buffer, as described above; If the requested data is found in the first buffer, the data may be provided to the caching client and the data may also be moved from the first cache to the second (long-term) cache based on an outcome of a probabilistic process such as the one described above.
  • the caching logic may cause the requested data to be retrieved from the main memory, through an interface to external memory, and stored either in the first or second caches.
  • the caching logic may determine whether to store the data retrieved from main memory using a probabilistic process, such as the ones described above, or by using any other probabilistically based process known today or to be devised in the future, or by any sampling processes known today or to be devised in the future.
  • the caching logic may determine whether to store the data retrieved from main memory using a predefined sampling pattern, such as the ones described above, or by using any other predefined sampling pattern process known today or to be devised in the future.
  • FIG. 1 is a diagram illustrating the basic correlation between a Main memory and a Cache memory, according to prior art
  • FIG. 2 is a diagram illustrating the basic operation of a Direct Mapped
  • FIG. 3A and FIG. 3B are block diagrams illustrating the functional building blocks of exemplary cache architectures, in accordance with some embodiments of the present invention.
  • FIG. 4A and FIG. 4B are flow charts illustrating the data flow in an exemplary caching process, in accordance with some embodiments of the present invention.
  • FIG. 5 is a diagram presenting the distributions of the residency lengths and those of the references serviced by each residency length for 4 select SPEC2000 benchmarks using a 16K direct-mapped cache, in accordance with some embodiments of the present invention. The figure shows the distributions of both data and instruction streams;
  • FIG. 6 is a diagram of an exemplary Ll cache design that uses Bernoulli trials to distinguish frequently used blocks from transient ones, in accordance with some embodiments of the present invention
  • FIG. 7 is a diagram of the structure of a fully associative cache, according to prior art
  • FIG. 8 is a diagram of an exemplary structure of a fully associative cache augmented by a sub-buffer (dubbed SLB in the figure) used to store recent activity information, in accordance with some embodiments of the present invention
  • FIG. 9 is a diagram presenting an exemplary comparison of improvements in SPEC2000 instruction and data miss-rate distributions, using various sampling probabilities, for a 16K-DM cache, in accordance with some embodiments of the present invention.
  • the boxes represent the 25%-75% percentile range, and the whiskers indicate the min/max values. Inside the box are the average (circle) and median (horizontal line);
  • FIG. 10 is a diagram presenting an exemplary comparison of the data references' mass distributions in the filtered cache structure and the regular cache structure for select SPEC benchmarks using the ref input, for both data (top) and instruction (bottom), in accordance with some embodiments of the present invention.
  • FIG. 11 is an exemplary diagram presenting the percent of references serviced by the cache vs. the percent of blocks transferred from the filter into the cache for varying sampling probabilities (averages over all benchmarks), in accordance with some embodiments of the present invention.
  • the horizontal lines near the top figure indicate the asymptotic maximum of references serviced by the cache.
  • Cache size was 16K, with a 2K filter;
  • FIG. 12 is an exemplary diagram presenting the distributions of filter access depth for all SPEC2000 benchmarks, and the average distribution, in accordance with some embodiments of the present invention. Note that the vast majority of accesses are focused around the MRU position (the specific behavior for each benchmark is irrelevant in this context, and are thus not individually marked);
  • FIG. 13 is an exemplary diagram presenting the IPC improvement for
  • FIG. 14 is an exemplary diagram presenting the average IPC improvement for 16K and 32K direct-mapped filtered caches over common cache configurations, in accordance with some embodiments of the present invention.
  • FIG. 15 is an exemplary diagram presenting the Relative power consumption of the random sampling cache, compared to common cache designs (lower is better), for a 70nm process, in accordance with some embodiments of the present invention.
  • Embodiments of the present invention may include apparatuses for performing the operations herein. Such apparatus may be specially constructed for the desired purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, readonly memories (ROMs), random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, non-volatile solid state memories (FLASH), or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus.
  • a computer readable storage medium such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, readonly memories (ROMs), random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, non-volatile solid state memories (FLASH), or any other type of
  • the present invention is a method, architecture, circuit and system for providing caching of data.
  • a first cache which first cache may be part of a cache architecture including a second cache and cache control logic.
  • the first cache and cache architecture may be integrally associated with a processor (e.g. Central Processing Unit - CPU) and in such cases may be referred to as processor cache.
  • processor cache e.g. Central Processing Unit - CPU
  • features of the first cache and of the overall caching architecture described in the present application may be functionally associated with any data caching application or any data caching client (e.g. CPU, disk cache, etc.) known today or to be devised in the future.
  • any data caching application e.g. CPU, disk cache, etc.
  • the cache arrangements 300 may include a first cache 320, a second cache 330, a cache controller or caching logic 310, and an interface to external memory 312.
  • the first cache 320 may be a fully associative cache, and may include a recent activity sub-buffer.
  • the cache controller may be adapted to operate based on a probabilistic insertion policy (Fig. 3A) or a periodic (based on predefined sampling pattern) insertion policy (Fig. 3B), and may thus include or be functionally associated with a random number generator 314 (Fig. 3A) or with a circuit functionally equivalent to a random number generator, or with a counter or lookup table (Fig. 3B).
  • the first cache 320 may be referred to as a bypass cache or bypass filter, and may buffer data that is retrieved from an external memory source such as a computing platform's main random access memory (“RAM”) or from a non-volatile memory (“NVM”) device functionally associated with the computing platform.
  • the first cache 320 may include a set of memory cells adapted to store both retrieved data and mapping information (e.g. addresses of the received data in the main memory) relating to the retrieved data.
  • the first cache 320 may be operated according to any suitable caching technique or algorithm, known today or to be devised in the future, including:
  • 2- way set associative for high-speed CPU caches where even PLRU is too slow.
  • the address of a new item is used to calculate one of two possible locations in the cache where it is allowed to go.
  • the LRU of the two is discarded. This requires one bit per pair of cache lines, to indicate which of the two was the least recently used.
  • LRU Least Recently Used
  • MRU Most Recently Used
  • Pseudo-LRU For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a probabilistic scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
  • Direct-mapped cache for the highest-speed CPU caches where even 2- way set associative caches are too slow. The address of the new item is used to calculate the one location in the cache where it is allowed to go. Whatever was there before is discarded.
  • LFU Least Frequently Used
  • Adaptive Replacement Cache constantly balances between LRU and LFU, to improve combined results.
  • the second cache 330 which may also be referred to as long-term cache, may also be operated according to any caching technique or algorithm known today or to be devised in the future, with direct mapping being a preferred algorithm.
  • the second cache 330 may receive and store, either from the first cache, directly from main memory or from any other data storage device, data requested by a caching client 340 such as a processor.
  • the second cache 330 may include a set of memory cells adapted to store both received data and mapping information (e.g. addresses of the received data in the main memory) relating to the received data.
  • the data retrieved into the first cache 320 may be retrieved from an external memory (RAM or NVM), which external memory may either be part of the main memory or main storage of a computing platform, or in the cases where the computing platform has a multilevel cache architecture, the external memory may be an external cache.
  • Cache control logic 310 e.g. a dedicated cache controller or cache control portion of a processor controller
  • the cache control logic 310 may use a probabilistic method/process or a periodic sampling method/process to determine when data requested by a functionally associated caching client 340 should be stored in the second cache 330.
  • the method/process by which the caching logic 310 determines when to store data on a given cache may be referred to as an insertion policy.
  • the caching logic 310 may cause the requested data to be stored in the second cache 330, for example, by first determining whether to store the data in the second cache 330 and then using a direct mapping algorithm which maps specific locations in the main memory with locations in the second cache 330.
  • the step of determining performed by the caching logic, or by any circuit functionally associated with the caching logic 314, may be based on a probabilistic model, a simple example of which is: "one in every 10 blocks of requested data will be requested again and thus should be cached.”
  • the caching controller 310 can use a random number generator 314, or the like, or a periodic sampling process to determine (i.e. guess) whether the requested block will be requested again.
  • the cache control 310 logic so determines (guesses), the control logic may cause the requested data block to be stored in the second cache 330 - also referred to as the long-term direct mapped cache 330.
  • the cache control logic may use a predetermined sampling pattern when determining which requested block of data is to be cached in the second cache.
  • the control logic may use a counter and may store to long term cache each Nth (e.g. 5th, 7th, 10th, etc..) block of requested data.
  • the control logic may use a lookup table which may indicate which of one or more requested blocks of data in a set of requested data blocks is to be placed in the second cache.
  • the lookup table may include entries indicating insertion into the second cache of the 2nd, 14th, 17th and 25th requested data block from each group/set of 30 data block requests by a caching client.
  • the first cache may be operated using a LRU algorithm and may include a buffer portion or sub-buffer (herein after referred to a "sub-buffer") adapted to store information relating to recent activity in the first cache, for example - recent memory writes to the first cache and recent memory reads from the first cache.
  • the buffer portion or sub-buffer may be updated each time the first cache is operated.
  • the sub- buffer may be checked first for the given data, or for any information relating to the given data which may reduce the time, power, or any other overheads associated in searching for the data in the entire first cache.
  • the sub-buffer which may also be referred to as a recent activity sub-buffer may include information relating to recent data read/write activity in the first cache, and may identify the recently operated upon data (written or read) and its location (e.g. pointer) in the first cache. If the given data, or information about its location in the first cache, is found in the sub-buffer, further examination of the first cache may be avoided. If, however, information about the given data is not found in the sub-buffer, further examination or scanning of the first cache may be required for the given data.
  • An exemplary embodiment of a fully-associative cache containing a sub-buffer is shown in Fig. 8.
  • the caching logic 310 may first scan the first and/or the second caches so as to determine whether a copy of any of the one or more requested data blocks may be present in the first or second caches. According to some embodiments of the present invention, the caching logic 310 may first scan the second (long-term) cache 330. If the requested data is found in the second cache 330, no further scanning may be required. If the data is not found in the second cache 330, the caching logic 310 may proceed to scan the first cache 320. Scanning of the second and first caches 320 may be performed by any scanning methodology known today or to be devised in the future.
  • the first and/or the second caches may be scanned using a recent activity sub-buffer, as described above. If the requested data is found in the first cache 320, the data may be provided to the caching client and the data may also be moved from the first cache 320 to the second (long-term) cache 330 based on an outcome of a probabilistic/sampling process such as the one described above. [0064] If the requested data is not found in the first cache, the caching logic may cause the requested data to be retrieved from the main memory, through an interface to external memory 312, and stored either in the first or second caches.
  • the caching logic may determine whether to store the data retrieved from main memory using a probabilistic process, such as the ones described above, or by using any other probabilistically based process known today or to be devised in the future, or by any sampling processes known today or to be devised in the future.
  • the caching logic may determine whether to store the data retrieved from main memory using a predefined sampling pattern, such as the ones described above, or by using any other predefined sampling pattern process known today or to be devised in the future.
  • the cache may be used by the central processing unit of a computer to reduce the average time to access memory.
  • the cache is generally a smaller, faster memory which stores copies of the data from the most frequently used main memory locations.
  • the processor may first check whether a copy of that data is in the cache. If so, the processor may immediately read from or write to the cache.
  • the memory reference workload of a CPU/cache combination may be characterized using a statistical phenomenon called mass-count disparity.
  • a random sampling Ll filtered cache may use a simple coin toss to preferentially insert frequently used blocks into a long term cache, thus reducing the number of conflict misses in the cache; and the rest of the references may be serviced from the filter/cache itself, which filter/cache may be a small fully-associative auxiliary structure.
  • Locality of reference is a phenomenon in computer workloads. Temporal locality in particular may occur because of two properties of reference streams: that some addresses are much more popular than others, and that accesses are batched rather than being random. Importantly, references to blocks that are seldom accessed are also grouped together; such blocks are referred to as transient.
  • a good way to visualize skewed popularity is by using mass-count disparity plots. These plots may superimpose two distributions. The first, which is called the count distribution, is a distribution on addresses, and specifies how many times, each address is referenced. Thus Fc(x) will represent the probability that an address is referenced x times or less. The second, called the mass distribution, is a distribution on references; it specifies the popularity of the address to which the reference pertains. Thus Fm(x) will represent the probability that a reference is directed at an address that is referenced x times or less.
  • FIG. 5 shows the distributions of residency lengths and those of the references serviced by each residency length for 4 select SPEC2000 benchmarks using a 16K direct-mapped cache. The figure shows the distributions of both data and instruction streams.
  • the term mass-count disparity refers to the fact that the distributions of residency lengths (count) and the number of references serviced by each residency length (mass) may be quite distinct, as shown in Rg. 5.
  • the divergence between the distributions can be quantified by the joint ratio, which is a generalization of for example the proverbial 20/80 principle: This is the unique point in the graphs where the sum of the two CDFs is 1. In the case of the vortex data stream graph for example, the joint ratio is approximately 13/87 (double-arrow at middle of plot). This means that 13% of the residencies, and more specifically the longest ones, get a full 87% of the references, whereas the remaining 87% of the residencies get only 13% of the references.
  • a Wl/2 metric may be used to assess the combined weight of the half of the residencies that receive few references. For vortex, 50% of the residencies together get only 3% of the references (left down-pointing arrow). Thus these are instances of blocks that are inserted into the cache but hardly used, and should actually not be allowed to pollute the cache. Rather, the cache should be used preferentially to store longer residencies, such as those that together account for 50% of the references.
  • the number of highly- referenced residencies servicing half the references is quantifier by a Nl/2 metric; for vortex it is less than 1% of all residencies (right up-pointing arrow).
  • Mass-count disparity implies that a small fraction of all Ll cache residencies service the majority of references. Servicing these residencies from a fast, low-power, direct-mapped cache, while using an auxiliary buffer for short, transient residencies, may yield both performance and power gains, as the small number of long residencies will minimize the number of conflict misses to which direct-mapped caches are so susceptible.
  • One approach to designing a residency predictor may be to use random sampling. By, for example, sampling references uniformly (a Bernoulli trial) with a relatively low probability P, short residencies will have a very low probability of being selected. But given that a single sample is enough to classify a residency as a long one, the probability that a residency is chosen within n references is 1- (1-P) ⁇ .
  • Rg. 9 shows the distributions of the miss-rate achieved by a filtered 16K, DM cache (both cache and filter misses) compared to that achieved by a regular 16K direct-mapped cache, for various Bernoulli success probabilities. Lower values are better, indicating decreased miss rate.
  • the data shown for each combination are a summary of the observed change in miss rate over all benchmarks simulated: the distribution's middle range (25%-75%), average, median and min/max values. An ideal combination would yield maximal overall miss-rate reduction with a dense distribution, i.e.
  • Figure 9 shows that the best average reduction in data miss-rate is ⁇ 25%, and may be achieved for P values of 0.05 to 0.2. Moreover, this average improvement is not the result of a single benchmark skewing the distribution: when comparing the center of these distributions— the 25%-75% box— we can see the entire distribution is moved downwards. The same can be said about the miss-rate reduction in the instruction stream, for which selection probabilities of 0.0001 to 0.01 all achieve an average improvement of ⁇ 60%. In this case as well the best averages may be achieved for probabilities that shift the entire distribution downwards.
  • Fig. 10 compares distributions of reference masses—the fraction of references serviced by each residency length of the filtered 16K cache and the original 16K direct-mapped cache. Results are shown for select SPEC benchmarks with Bernoulli success probabilities of 0.05 for data streams and 0.0005 for instruction streams. These probabilities were chosen based on the results described in Sect. 1.
  • Each plot shows three lines: the distributions for the cache and filter for the filtered design, and the distribution for a conventional direct-mapped cache—which is the combination of the first two (this is the same distribution as the one shown in Fig. 5).
  • the median value of each distribution is marked with a down pointing arrow.
  • the distributions show that the majority of references directed at the filter, are serviced by residencies much shorter than those serving the majority of the references directed at the cache proper.
  • the average medians ratio for all data streams is ⁇ 320, and ⁇ 50,000 for the instruction streams — indicating a clear distinction between residency lengths in the cache and the filter,
  • the second metric is denoted as the false-* equilibrium, and is an estimate of false predictions: Any given residency length threshold we choose in hindsight will show up on the plot as a vertical line, with a fraction of the cache's distribution to its left indicating the false-positives (short residencies promoted to the cache), and a fraction of the filter's distribution to its right indicating the false-negatives (long residencies remaining in the filter). Obviously, choosing another threshold will either: increase the fraction of false-positives and decrease the fraction of false-negatives or vice versa.
  • the false-* equilibrium is a unique threshold that if chosen, generates equal percentages of false-positives and false-negatives, thereby serving as an upper bound for overall percentage of false predictions.
  • FIG. 11 shows the percentage of references serviced by the cache, compared with the percentage of blocks promoted into the cache, for various probabilities. Considering the mass-count disparity we expect that promoting frequently accessed blocks into the cache will result in a substantial increase in the number of references it will service, and that promoting not-so-frequently used blocks have a smaller impact on the number references serviced by the cache. This is indeed evident in Fig.
  • a fully-associative filter introduces longer access latencies and increased power consumption.
  • the SLB is a small direct-mapped cache structure mapping block tags directly to the filter's SRAM based data store, thus avoiding the majority of the costly CAM lookups, while still maintain fully- associative semantics.
  • This section explores the SLB design space.
  • Fig. 12 shows the stack depths distributions of filter accesses, for the different SPEC benchmarks, as well as the average distribution over all benchmarks (the various benchmarks are not individually marked as only the clustering of distributions matters in this context).
  • the reduced miss-rate achieved by the random sampling design combined with a low-latency, low-power, direct-mapped cache potentially offers both improved performance and reduced power consumption. Augmenting the fully-associative filter with an SLB can reduce the overheads incurred by the filter, further improving efficiency.
  • Fig. 13 shows the IPC improvement achieved by a random sampling cache over a similar size 4-way associative cache, for the SPEC benchmarks. The figure shows consistent improvements (up to ⁇ 35% for a 16K random sampling cache and ⁇ 28% for 32K one), with an average overall IPC improvement of over ⁇ 10% for both 16K and 32K configurations.
  • Ll latency is 2 cycles for set-associative and fully-associative caches
  • Hg. 14 compares the average performance achieved with 16K and 32K random sampling caches to that of common cache structures. It shows that a direct-mapped random sampling filtered cache achieves significantly better performance than a similar size set-associative cache. Moreover, a random sampling cache can even gain better overall performance than larger, more expensive caches. For example, the IPC of a 16K-DM random sampling cache is 5% higher than that of a 32K-4way cache; a 32K-DM random sampling cache is more than 7% higher than 64K-4way. Likewise, using the extra 2K for a filter yields better performance than using them as a victim buffer, indicating that even such a relatively large victim buffer may be swamped by transient blocks.
  • the IPC improvement is similar when comparing the 16K-DM random sampling cache to both a regular 16K-DM cache and a 16K-
  • the average dynamic energy consumption is simply aggregate energy — the sum of number of accesses x access energy for each component- divided by the overall number of hits. Even simpler, the leakage power consumed by the random sampling cache is the sum of leakage power consumed by all components.
  • Rg. 15 shows both dynamic read energy and leakage power consumed by the random sampling cache, compared to common cache configurations (same as those in Fig. 14). Obviously, the power consumed by the random sampling cache is higher than that of a simple direct-mapped cache, because of the fully-associative filter: ⁇ 30% more dynamic energy and ⁇ 15% more leakage power for a 16K random sampling cache (and just over half that for a 32K cache). However, when comparing a random sampling cache to a more common 4-way associative cache of a similar size, the 16K random sampling cache design consumes almost 70%-80% less dynamic energy, with only 5% more leakage power. The 32K configuration yields 60%-
  • a method of distinguishing transient blocks from frequently used blocks thereby enabling servicing references to transient blocks from a small fully-associative auxiliary cache structure.
  • a probabilistic filtering mechanism may use random sampling to identify and select the frequently used blocks.
  • a 16K direct-mapped Ll cache augmented with a fully- associative 2K filter, arranged and operated according to some embodiments of the present invention may achieve on average over 10% more instructions per cycle than a regular 16K, 4-way set-associative cache, and even ⁇ 5% more IPC than a 32K, 4-way cache, while consuming 70%-80% less dynamic power than either of them.
  • similar and/or better results may be achieved for large caches.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention relates to methods, architectures, circuits and systems for providing caching. According to some embodiments of the present invention, there may be provided a first cache portion adapted to be operated according to a first caching algorithm, a second cache portion adapted to be operated according to a second caching algorithm; and cache control logic adapted to determining whether to insert data requested by a caching client into either the first or the second caching portions based on either a probabilistic insertion policy or based on a predefined sampling pattern.

Description

Method Architecture Circuit & System for Providing
Caching
FIELD QF INVENTION
[001] The present invention generally relates to the fields of data processing. More specifically, the present invention relates to methods, architectures, circuits and systems for providing caching.
BACKGROUND
[002] Use of the term "cache" in the computer context originated in 1967, during preparation of an article for publication in the IBM Systems Journal. The paper concerned an exciting memory improvement in Model 85, a latecomer in the IBM System/360 product line. The Journal editor, LyIe R. Johnson, pleaded for a more descriptive term than "high-speed buffer"; when none was forthcoming, he suggested "Cache" (from the French "cacher", meaning to hide). A cache is a block of memory for temporary storage of data likely to be used again. The CPU and hard drive frequently use a cache, as do web browsers and web servers.
[003] A cache is made up of a pool of entries. Each entry has a datum (a nugget of data) which is a copy of the datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy.
[004] When the cache client (a CPU, web browser, operating system) wishes to access a datum presumably in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. So, for example, a web browser program might check its local cache on disk to see if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is the tag, and the content of the web page is the datum. The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache.
[005] The alternative situation, when the cache is consulted and found not to contain a datum with the desired tag, is known as a cache miss. The previously uncached datum fetched from the backing store during miss handling is usually copied into the cache, ready for the next access. [006] During a cache miss, the cache controller usually ejects some other entry in order to make room for the previously uncached datum. The heuristic used to select the entry to eject is known as the replacement policy. One popular replacement policy, least recently used (LRU), replaces the least recently used entry. More efficient caches compute use frequency against the size of the stored contents, as well as the latencies and throughputs for both the cache and the backing store. While this works well for larger amounts of data, long latencies, and slow throughputs, such as experienced with a hard drive and the Internet, it's not efficient to use this for cached main memory (RAM).
[007] When a datum is written to the cache, it must at some point be written to the backing store as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a synchronous write to the backing store. [008] Alternatively, in a write-back (or write-behind) cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations have been written over (these locations are marked dirty). The data in these locations is written back to the backing store when those data are evicted from the cache. For this reason, a miss in a write-back cache (which requires a block to be replaced by another) may require two memory accesses to service: one to retrieve the needed datum, and one to write replaced data from the cache to the store.
[009] Data write-back may be triggered by other policies as well. The client may make many changes to a datum in the cache, and then explicitly notify the cache to write back the datum.
[0010] No-write allocation is a cache policy where only processor reads are cached, thus avoiding the need for write-back or write-through when the old value of the datum was absent from the cache prior to the write.
[0011] The data in the backing store may be changed by entities other than the cache, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of that data in other caches will become stale. Communication protocols between the cache controllers which keep the data consistent are known as coherency protocols.
[0012] The diagram on Fig. 1 shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested by a CPU instruction, which ranges from 1 to 16 bytes. Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag which contains the index of the datum in main memory which has been cached. In a
CPU's data cache these entries are called cache lines or cache blocks.
[0013] Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation look aside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.
[0014] Looking at Fig. 2, there are shown two possible caching methodologies relating to the cache's replacement policy. The replacement policy decides where in the cache a copy of a particular entry of main memory will go. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise, and are described as set associative. For example, the level-1 data cache in an AMD Athlon is 2-way set associative, which means that any particular location in main memory can be cached in either of 2 locations in the level-1 data cache. [0015] Associativity is a trade-off. As evident from examination of Fig. 2, if there are ten places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all ten places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses, so that the CPU spends less time servicing those misses. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4- way, has about the same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate, and are generally done for other reasons.
[0016] Listed below are various cache types, along with their hit time and miss rate characteristics:
1. Direct mapped cache ~ the best (fastest) hit times, and so the best tradeoff for "large" caches;
2. 2-way set associative cache;
3. 2-way skewed associative cache ~ "the best tradeoff for caches whose sizes are in the range 4K-8K bytes";
4. 4-way set associative cache; and
5. fully associative cache ~ the best (lowest) miss rates, and so the best tradeoff when the miss penalty is very high
[0017] One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that datum is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
[0018] The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below.
[0019] Other schemes have been suggested, such as the skewed cache, where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. [0020] The increasing gap between processor and memory speeds witnessed in recent years has exacerbated the CPU's dependency on the memory system performance — and especially that of Ll caches with which the CPU interfaces directly. One result of this ongoing trend is the increase in the capacity of Ll and L2 caches. This improvement, however, also increased the power consumed by the caches — estimated at more than 10% of the overall power consumed by a general purpose CPU, and up to 40% of the power consumed by CPUs designed for embedded systems. Hence today, as processor power consumption is also becoming a major concern, the power-performance tradeoff is ever more important. [0021] Memory usage is known to be highly skewed, with most references directed at a relatively small subset of the address space. By identifying these references and servicing them using power-efficient, direct-mapped Ll caches, we can potentially increase CPU performance while at the same time reducing the power consumption.
[0022] Direct-mapped caches are faster and consume less energy than set- associative caches typically used in Ll caches. However, they are more susceptible to conflict misses than set-associative caches, thus suffering higher miss-rates and achieving lower performance. This deficiency led to abandoning direct-mapped Ll caches in favor of set-associative ones in practically all but embedded processors.
[0023] There is a need in the fields of computing and data caching for improved methods, circuits, devices and system for memory caching.
SUMMARY OF INVENTION
[0024] The present invention is a method, architecture, circuit and system for providing caching of data. According to some embodiments of the present invention, there may be provided a first cache, which first cache may be part of a cache architecture including a second cache and cache control logic. The first cache and cache architecture may be integrally associated with a processor (e.g. Central Processing Unit - CPU) and in such cases may be referred to as processor cache. However, according to further embodiments of the present invention, features of the first cache and of the overall caching architecture described in the present application may be functionally associated with any data caching application or any data caching client (e.g. CPU, disk cache, etc.) known today or to be devised in the future. [0025] The first cache may be referred to as a bypass cache or bypass filter, and may buffer data that is retrieved from an external memory source such as a computing platform's main random access memory ("RAM") or from a nonvolatile memory ("NVM") device functionally associated with the computing platform. The first cache may include a set of memory cells adapted to store both retrieved data and mapping information (e.g. addresses of the received data in the main memory) relating to the retrieved data. The first cache may be operated according to any suitable caching technique or algorithm, known today or to be devised in the future, including:
Fully associative cache
2-way set associative: for high-speed CPU caches where even PLRU is too slow. The address of a new item is used to calculate one of two possible locations in the cache where it is allowed to go. The LRU of the two is discarded. This requires one bit per pair of cache lines, to indicate which of the two was the least recently used.
Least Recently Used (LRU): discards the least recently used items first.
Most Recently Used (MRU): discards, in contrast to LRU, the most recently used items first.
Pseudo-LRU (PLRU): For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a probabilistic scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
Direct-mapped cache: for the highest-speed CPU caches where even 2- way set associative caches are too slow. The address of the new item is used to calculate the one location in the cache where it is allowed to go. Whatever was there before is discarded.
Least Frequently Used (LFU): LFU counts how often an item is needed. Those that are used least often are discarded first. Adaptive Replacement Cache (ARC): constantly balances between LRU and LFU, to improve combined results.
[0026] The second cache, which may also be referred to as long-term cache, may also be operated according to any caching technique or algorithm known today or to be devised in the future, with direct mapping being a preferred mode of operation. The second cache may receive and store, either from the first cache or directly from main memory, data requested by a caching client such as a processor. According to some embodiments of the present invention, the second cache may include a set of memory cells adapted to store both received data and mapping information (e.g. addresses of the received data in the main memory) relating to the received data. [0027] It should be understood by one skilled in the art that the data retrieved into the first cache, may be retrieved from an external memory (RAM or NVM), which external memory may either be part of the main memory or main storage of a computing platform, or in the cases where the computing platform has a multilevel cache architecture, the external memory may be an external cache.
[0028] Cache control logic (e.g. a dedicated cache controller or cache control portion of a processor controller) may coordinate the movement of data between: (1) the first and second caches, (2) the caches and the caching client (e.g. processor), and (3) between external data sources or memory (e.g. main memory or NVM) and either of the caches. According to some embodiments of the present invention, the cache control logic may use a probabilistic method/process or a periodic sampling method/process to determine when data requested by a functionally associated caching client should be stored in the second cache. The method/process by which the caching logic determines when to store data in a given cache may be referred to as an insertion policy. Each time the caching client requests a block of data, the caching logic may cause the requested data to be stored in the second cache, for example, by first determining whether to store the data in the second cache and then using a direct mapping algorithm which maps specific locations in the main memory with locations in the second cache. The step of determining performed by the caching logic, or by any circuit functionally associated with the caching logic, may be based on a probabilistic model, a simple example of which is: "one in every 10 blocks of requested data will be requested again and thus should be cached." Thus, each time a block of data is requested by the caching client, as part of executing its probabilistic insertion policy, the caching controller can use a random number generator, or the like, to determine (i.e. guess) whether the requested block will be requested again. And, if the cache-control logic so determines (guesses), the control logic may cause the requested data block to be stored in the second cache - also referred to as the long-term cache. [0029] According to further embodiments of the present invention, the cache control logic may use a predetermined sampling pattern when determining which requested block of data is to be cached in the second cache. For example, the control logic may use a counter and may store to long term cache each Nth (e.g. 5th, 7th, 10th, etc..) block of requested data. According to further embodiments of the present invention, the control logic may use a lookup table which may indicate which of one or more requested blocks of data in a set of requested data blocks is to be placed in the second cache. For example, the lookup table may include entries indicating insertion into the second cache of the 2nd, 14th, 17th and 25th requested data block from each group/set of 30 data block requests by a caching client. [0030] According to some embodiments of the present invention, the first cache may be operated using a LRU algorithm and may include a buffer portion or sub-buffer (herein after referred to a "sub-buffer") adapted to store information relating to recent activity in the first cache, for example - recent memory writes to the first cache and recent memory reads from the first cache. The buffer portion or sub-buffer may be updated each time the first cache is operated. As part of checking the first cache for given data, the sub- buffer may be checked first for the given data , or for any information relating to the given data which may reduce the time, power, or any other overheads associated in searching for the data in the entire first cache. The sub-buffer, which may also be referred to as a recent activity sub-buffer may include information relating to recent data read/write activity in the first cache, and may identify the recently operated upon data (written or read) and its location (e.g. pointer) in the first cache. If the given data, or information about its location in the first cache, is found in the sub-buffer, further examination of the first cache may be avoided. If, however, information about the given data is not found in the sub-buffer, further examination or scanning of the first cache may be required for the given data.
[0031] In response to a caching client request for one or more data blocks, the caching logic may first scan the first and/or the second caches so as to determine whether a copy of any of the one or more requested data blocks may be present in the first or second caches. According to some embodiments of the present invention, the caching logic may first scan the second (long-term) cache. If the requested data is found in the second cache, no further scanning may be required. If the data is not found in the second cache, the caching logic may proceed to scan the first cache. Scanning of the second and first caches may be performed by any scanning methodology known today or to be devised in the future. According to some embodiments of the present invention, the first and/or the second caches may be scanned using a recent activity sub-buffer, as described above; If the requested data is found in the first buffer, the data may be provided to the caching client and the data may also be moved from the first cache to the second (long-term) cache based on an outcome of a probabilistic process such as the one described above.
[0032] If the requested data is not found in the first cache, the caching logic may cause the requested data to be retrieved from the main memory, through an interface to external memory, and stored either in the first or second caches. The caching logic may determine whether to store the data retrieved from main memory using a probabilistic process, such as the ones described above, or by using any other probabilistically based process known today or to be devised in the future, or by any sampling processes known today or to be devised in the future. The caching logic may determine whether to store the data retrieved from main memory using a predefined sampling pattern, such as the ones described above, or by using any other predefined sampling pattern process known today or to be devised in the future.
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
[0034] FIG. 1 is a diagram illustrating the basic correlation between a Main memory and a Cache memory, according to prior art;
[0035] FIG. 2 is a diagram illustrating the basic operation of a Direct Mapped
Cache Fill and a 2-Way Associative Cache Fill, according to prior art;
[0036] FIG. 3A and FIG. 3B are block diagrams illustrating the functional building blocks of exemplary cache architectures, in accordance with some embodiments of the present invention;
[0037] FIG. 4A and FIG. 4B are flow charts illustrating the data flow in an exemplary caching process, in accordance with some embodiments of the present invention;
[0038] FIG. 5 is a diagram presenting the distributions of the residency lengths and those of the references serviced by each residency length for 4 select SPEC2000 benchmarks using a 16K direct-mapped cache, in accordance with some embodiments of the present invention. The figure shows the distributions of both data and instruction streams;
[0039] FIG. 6 is a diagram of an exemplary Ll cache design that uses Bernoulli trials to distinguish frequently used blocks from transient ones, in accordance with some embodiments of the present invention; [0040] FIG. 7 is a diagram of the structure of a fully associative cache, according to prior art;
[0041] FIG. 8 is a diagram of an exemplary structure of a fully associative cache augmented by a sub-buffer (dubbed SLB in the figure) used to store recent activity information, in accordance with some embodiments of the present invention;
[0042] FIG. 9 is a diagram presenting an exemplary comparison of improvements in SPEC2000 instruction and data miss-rate distributions, using various sampling probabilities, for a 16K-DM cache, in accordance with some embodiments of the present invention. The boxes represent the 25%-75% percentile range, and the whiskers indicate the min/max values. Inside the box are the average (circle) and median (horizontal line); [0043] FIG. 10 is a diagram presenting an exemplary comparison of the data references' mass distributions in the filtered cache structure and the regular cache structure for select SPEC benchmarks using the ref input, for both data (top) and instruction (bottom), in accordance with some embodiments of the present invention. The horizontal double arrows show the median-to-median range, and the vertical double arrows show the false-* equilibrium point; [0044] FIG. 11 is an exemplary diagram presenting the percent of references serviced by the cache vs. the percent of blocks transferred from the filter into the cache for varying sampling probabilities (averages over all benchmarks), in accordance with some embodiments of the present invention. The horizontal lines near the top figure indicate the asymptotic maximum of references serviced by the cache. Cache size was 16K, with a 2K filter; [0045] FIG. 12 is an exemplary diagram presenting the distributions of filter access depth for all SPEC2000 benchmarks, and the average distribution, in accordance with some embodiments of the present invention. Note that the vast majority of accesses are focused around the MRU position (the specific behavior for each benchmark is irrelevant in this context, and are thus not individually marked);
[0046] FIG. 13 is an exemplary diagram presenting the IPC improvement for
DM random sampling caches (2K filter) over similar size 4-way caches, in accordance with some embodiments of the present invention;
[0047] FIG. 14 is an exemplary diagram presenting the average IPC improvement for 16K and 32K direct-mapped filtered caches over common cache configurations, in accordance with some embodiments of the present invention; and
[0048] FIG. 15 is an exemplary diagram presenting the Relative power consumption of the random sampling cache, compared to common cache designs (lower is better), for a 70nm process, in accordance with some embodiments of the present invention.
DETAILED DESCRIPTION
[0049] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
[0050] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "processing", "computing", "calculating", "determining", or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices. [0051] Embodiments of the present invention may include apparatuses for performing the operations herein. Such apparatus may be specially constructed for the desired purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, readonly memories (ROMs), random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, non-volatile solid state memories (FLASH), or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus.
[0052] The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general- purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the inventions as described herein.
GENERAL EMBODIMENTS
[0053] The present invention is a method, architecture, circuit and system for providing caching of data. According to some embodiments of the present invention, there may be provided a first cache, which first cache may be part of a cache architecture including a second cache and cache control logic. The first cache and cache architecture may be integrally associated with a processor (e.g. Central Processing Unit - CPU) and in such cases may be referred to as processor cache. However, according to further embodiments of the present invention, features of the first cache and of the overall caching architecture described in the present application may be functionally associated with any data caching application or any data caching client (e.g. CPU, disk cache, etc.) known today or to be devised in the future. [0054] Turning now to Figs. 3A and 3B, there are shown block diagrams of exemplary cache arrangements (i.e. architectures) 300 according to some embodiments of the present invention. The cache arrangements 300 may include a first cache 320, a second cache 330, a cache controller or caching logic 310, and an interface to external memory 312. The first cache 320 may be a fully associative cache, and may include a recent activity sub-buffer. The cache controller may be adapted to operate based on a probabilistic insertion policy (Fig. 3A) or a periodic (based on predefined sampling pattern) insertion policy (Fig. 3B), and may thus include or be functionally associated with a random number generator 314 (Fig. 3A) or with a circuit functionally equivalent to a random number generator, or with a counter or lookup table (Fig. 3B).
[0055] The operation of the cache arrangement, and each of the components within the cache arrangement shown in Fig. 3, may be described in conjunction with the flow chart of Fig. 4.
[0056] The first cache 320 may be referred to as a bypass cache or bypass filter, and may buffer data that is retrieved from an external memory source such as a computing platform's main random access memory ("RAM") or from a non-volatile memory ("NVM") device functionally associated with the computing platform. The first cache 320 may include a set of memory cells adapted to store both retrieved data and mapping information (e.g. addresses of the received data in the main memory) relating to the retrieved data. The first cache 320 may be operated according to any suitable caching technique or algorithm, known today or to be devised in the future, including:
Fully associative cache
2- way set associative: for high-speed CPU caches where even PLRU is too slow. The address of a new item is used to calculate one of two possible locations in the cache where it is allowed to go. The LRU of the two is discarded. This requires one bit per pair of cache lines, to indicate which of the two was the least recently used.
Least Recently Used (LRU): discards the least recently used items first.
Most Recently Used (MRU): discards, in contrast to LRU, the most recently used items first.
Pseudo-LRU (PLRU): For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a probabilistic scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
Direct-mapped cache: for the highest-speed CPU caches where even 2- way set associative caches are too slow. The address of the new item is used to calculate the one location in the cache where it is allowed to go. Whatever was there before is discarded.
Least Frequently Used (LFU): LFU counts how often an item is needed. Those that are used least often are discarded first.
Adaptive Replacement Cache (ARC): constantly balances between LRU and LFU, to improve combined results.
[0057] The second cache 330, which may also be referred to as long-term cache, may also be operated according to any caching technique or algorithm known today or to be devised in the future, with direct mapping being a preferred algorithm. The second cache 330 may receive and store, either from the first cache, directly from main memory or from any other data storage device, data requested by a caching client 340 such as a processor. According to some embodiments of the present invention, the second cache 330 may include a set of memory cells adapted to store both received data and mapping information (e.g. addresses of the received data in the main memory) relating to the received data.
[0058] It should be understood by one skilled in the art that the data retrieved into the first cache 320, may be retrieved from an external memory (RAM or NVM), which external memory may either be part of the main memory or main storage of a computing platform, or in the cases where the computing platform has a multilevel cache architecture, the external memory may be an external cache. [0059] Cache control logic 310 (e.g. a dedicated cache controller or cache control portion of a processor controller) may coordinate the movement of data between: (1) the first and second caches, (2) the caches and the caching client 340 (e.g. processor), and (3) between external data sources or memory (e.g. main memory or NVM) and either of the caches.
[0060] According to some embodiments of the present invention, the cache control logic 310 may use a probabilistic method/process or a periodic sampling method/process to determine when data requested by a functionally associated caching client 340 should be stored in the second cache 330. The method/process by which the caching logic 310 determines when to store data on a given cache may be referred to as an insertion policy. Each time the caching client 340 requests a block of data, the caching logic 310 may cause the requested data to be stored in the second cache 330, for example, by first determining whether to store the data in the second cache 330 and then using a direct mapping algorithm which maps specific locations in the main memory with locations in the second cache 330. The step of determining performed by the caching logic, or by any circuit functionally associated with the caching logic 314, may be based on a probabilistic model, a simple example of which is: "one in every 10 blocks of requested data will be requested again and thus should be cached." Thus, each time a block of data is requested by the caching client 340, as part of executing its probabilistic insertion policy, the caching controller 310 can use a random number generator 314, or the like, or a periodic sampling process to determine (i.e. guess) whether the requested block will be requested again. And, if the cache control 310 logic so determines (guesses), the control logic may cause the requested data block to be stored in the second cache 330 - also referred to as the long-term direct mapped cache 330.
[0061] According to further embodiments of the present invention, the cache control logic may use a predetermined sampling pattern when determining which requested block of data is to be cached in the second cache. For example, the control logic may use a counter and may store to long term cache each Nth (e.g. 5th, 7th, 10th, etc..) block of requested data. According to further embodiments of the present invention, the control logic may use a lookup table which may indicate which of one or more requested blocks of data in a set of requested data blocks is to be placed in the second cache. For example, the lookup table may include entries indicating insertion into the second cache of the 2nd, 14th, 17th and 25th requested data block from each group/set of 30 data block requests by a caching client. [0062] According to some embodiments of the present invention, the first cache may be operated using a LRU algorithm and may include a buffer portion or sub-buffer (herein after referred to a "sub-buffer") adapted to store information relating to recent activity in the first cache, for example - recent memory writes to the first cache and recent memory reads from the first cache. The buffer portion or sub-buffer may be updated each time the first cache is operated. As part of checking the first cache for given data, the sub- buffer may be checked first for the given data, or for any information relating to the given data which may reduce the time, power, or any other overheads associated in searching for the data in the entire first cache. The sub-buffer, which may also be referred to as a recent activity sub-buffer may include information relating to recent data read/write activity in the first cache, and may identify the recently operated upon data (written or read) and its location (e.g. pointer) in the first cache. If the given data, or information about its location in the first cache, is found in the sub-buffer, further examination of the first cache may be avoided. If, however, information about the given data is not found in the sub-buffer, further examination or scanning of the first cache may be required for the given data. An exemplary embodiment of a fully-associative cache containing a sub-buffer is shown in Fig. 8. [0063] In response to a caching client 340 request for one or more data blocks, the caching logic 310 may first scan the first and/or the second caches so as to determine whether a copy of any of the one or more requested data blocks may be present in the first or second caches. According to some embodiments of the present invention, the caching logic 310 may first scan the second (long-term) cache 330. If the requested data is found in the second cache 330, no further scanning may be required. If the data is not found in the second cache 330, the caching logic 310 may proceed to scan the first cache 320. Scanning of the second and first caches 320 may be performed by any scanning methodology known today or to be devised in the future. According to some embodiments of the present invention, the first and/or the second caches may be scanned using a recent activity sub-buffer, as described above. If the requested data is found in the first cache 320, the data may be provided to the caching client and the data may also be moved from the first cache 320 to the second (long-term) cache 330 based on an outcome of a probabilistic/sampling process such as the one described above. [0064] If the requested data is not found in the first cache, the caching logic may cause the requested data to be retrieved from the main memory, through an interface to external memory 312, and stored either in the first or second caches. The caching logic may determine whether to store the data retrieved from main memory using a probabilistic process, such as the ones described above, or by using any other probabilistically based process known today or to be devised in the future, or by any sampling processes known today or to be devised in the future. The caching logic may determine whether to store the data retrieved from main memory using a predefined sampling pattern, such as the ones described above, or by using any other predefined sampling pattern process known today or to be devised in the future.
SPECIFIC EXEMPLARY EMBODIMENT & ASSOCIATED DATA
[0065] According to embodiments of the present invention where the cache is a CPU cache, the cache may be used by the central processing unit of a computer to reduce the average time to access memory. The cache is generally a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. When the processor wishes to read from or write to a location in main memory, it may first check whether a copy of that data is in the cache. If so, the processor may immediately read from or write to the cache. [0066] According to some embodiments of the present invention, the memory reference workload of a CPU/cache combination may be characterized using a statistical phenomenon called mass-count disparity. There may be provided a random sampling Ll filtered cache that may use a simple coin toss to preferentially insert frequently used blocks into a long term cache, thus reducing the number of conflict misses in the cache; and the rest of the references may be serviced from the filter/cache itself, which filter/cache may be a small fully-associative auxiliary structure.
THEORETICAL BASIS: MASS-COUNT DISPARITY PHENOMENON
[0067] Locality of reference is a phenomenon in computer workloads. Temporal locality in particular may occur because of two properties of reference streams: that some addresses are much more popular than others, and that accesses are batched rather than being random. Importantly, references to blocks that are seldom accessed are also grouped together; such blocks are referred to as transient.
[0068] A good way to visualize skewed popularity is by using mass-count disparity plots. These plots may superimpose two distributions. The first, which is called the count distribution, is a distribution on addresses, and specifies how many times, each address is referenced. Thus Fc(x) will represent the probability that an address is referenced x times or less. The second, called the mass distribution, is a distribution on references; it specifies the popularity of the address to which the reference pertains. Thus Fm(x) will represent the probability that a reference is directed at an address that is referenced x times or less.
[0069] The above mentioned definitions consider all the references to each address, throughout the duration of the run. But since the relative popularity of different addresses may change in different phases of the computation, the instantaneous popularity may be more important for caching studies. A possible solution is to use a certain sliding window size, and only consider references made within this window. This in turn suffers from a dependence on the window size. One solution is not to count all the references to each address, but to count only the number of references made between a single insertion of a block into the cache, and its corresponding eviction — denoted as the cache residency length. Thus, if a certain block is referenced 100 times when it is brought into the cache for the first time, is then evicted, and finally is referenced again for 200 times when brought into the cache for the second time, we will consider this as two distinct cache residencies spanning 100 and 200 references, respectively, rather than a single block with 300 references. [0070] Fig. 5, shows the distributions of residency lengths and those of the references serviced by each residency length for 4 select SPEC2000 benchmarks using a 16K direct-mapped cache. The figure shows the distributions of both data and instruction streams.
[0071] The term mass-count disparity refers to the fact that the distributions of residency lengths (count) and the number of references serviced by each residency length (mass) may be quite distinct, as shown in Rg. 5. The divergence between the distributions can be quantified by the joint ratio, which is a generalization of for example the proverbial 20/80 principle: This is the unique point in the graphs where the sum of the two CDFs is 1. In the case of the vortex data stream graph for example, the joint ratio is approximately 13/87 (double-arrow at middle of plot). This means that 13% of the residencies, and more specifically the longest ones, get a full 87% of the references, whereas the remaining 87% of the residencies get only 13% of the references. Thus, a typical residency is only referenced a rather small number of times (up to about 10), whereas a typical reference is directed at a long cache residency (one that is accessed from 100 to 10,000 times). [0072] A Wl/2 metric may be used to assess the combined weight of the half of the residencies that receive few references. For vortex, 50% of the residencies together get only 3% of the references (left down-pointing arrow). Thus these are instances of blocks that are inserted into the cache but hardly used, and should actually not be allowed to pollute the cache. Rather, the cache should be used preferentially to store longer residencies, such as those that together account for 50% of the references. The number of highly- referenced residencies servicing half the references is quantifier by a Nl/2 metric; for vortex it is less than 1% of all residencies (right up-pointing arrow).
[0073] The existence of significance of mass-count disparity has consequences regarding sampling. Specifically, if one picks a block (or caching instance) at random, there is a good chance that it is seldom referenced. That is why random replacement is a possible eviction policy. But if one picks a reference at random, there is a good chance that this reference refers to a block that is referenced very many times. Thus, according to some embodiments of the present invention, random sampling of references may be expected to identify those blocks that are most deserving to be inserted into the cache. Such an insertion policy for Ll caches is described in greater detail below.
[0074] Mass-count disparity implies that a small fraction of all Ll cache residencies service the majority of references. Servicing these residencies from a fast, low-power, direct-mapped cache, while using an auxiliary buffer for short, transient residencies, may yield both performance and power gains, as the small number of long residencies will minimize the number of conflict misses to which direct-mapped caches are so susceptible. [0075] One approach to designing a residency predictor may be to use random sampling. By, for example, sampling references uniformly (a Bernoulli trial) with a relatively low probability P, short residencies will have a very low probability of being selected. But given that a single sample is enough to classify a residency as a long one, the probability that a residency is chosen within n references is 1- (1-P)π.
[0076] This converges exponentially to 1 as n increases. Importantly, implementing such a predictor does not require saving any state information for the blocks, since every random selection is independent of its predecessors. The only hardware required is a random number generator — a linear-feedback shift register, for example. This random sampling mechanism serves as the base for our cache design.
EXEMPLARY EXPERIMENT A
[0077] The description of the following experiment illustrates exemplary embodiments of various aspects of the present invention. The present invention is not limited to any specifically recited feature of Experiment A. [0078] Based on the principles described in the previous section we introduce an Ll cache design that uses Bernoulli trials to distinguish frequently used blocks from transient ones. The proposed design, based on the dual cache paradigm, is depicted in Fig. 6. It consists of a direct-mapped cache preceded by a small, fully-associative cache/filter. When a memory access occurs, the data is first searched in the cache proper, and only if that misses the filter is searched. If the filter misses as well, the request is sent to the next level cache. In our experiments, we have used 16K and 32K (common Ll sizes) for the direct-mapped cache, and a 2K fully associative filter (all structures use 64B lines). Each memory reference that is serviced by the filter or by the next level cache, initiates a Bernoulli trial with a predetermined success probability P to decide whether it should be promoted into the cache proper. Note that this arrangement enables a block fetched from the next level cache to skip the filter altogether and jump directly into the cache. This decision is made by the memory reference sampling unit (MRSU) which performs the Bernoulli trials, and writes the block to the cache if selected. In case the block is not selected, and was not already present in the filter, the MRSU inserts it into the filter. The MRSU can in fact perform the sampling itself even before the data is fetched, enabling it to perform any necessary eviction (either from the cache proper of the filter) beforehand, thus overlapping the two operations.
Experiment A: Methodology
[0079] To evaluate the concepts presented in this paper we have used a modified version of the SimpleScalar toolset. The modifications include replacing SimpleScalar's cache module, as well as fixing the code of its out-of- order simulator to accommodate a non-random-access Ll cache model, where a hit latency is not constant but rather depends on whether the target block was found in the filter or the cache proper. We have used the SPEC2000 benchmarks suite compiled for the Alpha AXP architecture. All benchmarks were executed with the ref input set and were fast forwarded 15 billion instructions to skip any initialization code (except for vpr whose full run is shorter), and were then executed for another 2 billion instructions. Power estimates were compiled using CACTI 4.1.
EXPERIMENT A: RESULTS
The Effects of Random Sampling
[0080] The number of references to frequently-used blocks is numerous, but involves only a relatively small number of distinct blocks. This reduces the number of conflict misses, enabling the use of a low-latency, low-energy, direct-mapped cache structure. On the other hand, transient residencies compose the majority of residencies, but naturally have a shorter cache lifetime. Therefore, they can be served by a smaller, fully-associative (and expensive) structure. However, aggressive filtering might be counterproductive: if too many blocks are serviced from the filter and not promoted to the cache proper, the filter can become a bottleneck and degrade performance. This section is therefore dedicated to evaluate the effectiveness of probabilistic filtering, while exploring the statistical design space. The selected parameters are then used to evaluate performance and power consumption (below). 1 Impact on Miss-Rate
[0081] First, we address the effects of filtering on the overall miss-rate in order to determine the Bernoulli probabilities that yields best cache performance. Rg. 9 shows the distributions of the miss-rate achieved by a filtered 16K, DM cache (both cache and filter misses) compared to that achieved by a regular 16K direct-mapped cache, for various Bernoulli success probabilities. Lower values are better, indicating decreased miss rate. The data shown for each combination are a summary of the observed change in miss rate over all benchmarks simulated: the distribution's middle range (25%-75%), average, median and min/max values. An ideal combination would yield maximal overall miss-rate reduction with a dense distribution, i.e. small differences between the 25%-75% percentiles and min-max values, as a denser distribution indicates more consistent results over all benchmarks. [0082] Figure 9, shows that the best average reduction in data miss-rate is ~25%, and may be achieved for P values of 0.05 to 0.2. Moreover, this average improvement is not the result of a single benchmark skewing the distribution: when comparing the center of these distributions— the 25%-75% box— we can see the entire distribution is moved downwards. The same can be said about the miss-rate reduction in the instruction stream, for which selection probabilities of 0.0001 to 0.01 all achieve an average improvement of ~60%. In this case as well the best averages may be achieved for probabilities that shift the entire distribution downwards. [0083] The fact that the a similar improvement is achieved over a range of probabilities, for both data and instructions, indicates that using a static selection probability is a reasonable choice, especially as it eliminates the need to add a dynamic tuning mechanism. We therefore chose sampling probabilities of 0.05 and 0.0005 for data and instruction stream, respectively, for the 16K cache configuration. In a similar manner, probabilities of 0.1 and 0.0005 were selected for the data and instruction streams, respectively, for the 32K configuration.
[0084] Interestingly, the data and instruction stream require different Bernoulli success probabilities — with an order of magnitude difference! The reason for this is the fact that the instruction memory blocks are usually accessed over an order of magnitude more times compared to data blocks. In the benchmarks shown in Fig. 5, 50% of the data memory blocks are accessed 1-2 times while in the cache, whereas the same percentile of instruction blocks are accessed 10-15 times. This difference is mainly attributed to the fact the instruction memory blocks are mostly read sequentially as blocks of instructions. Impact on Reference Distribution
[0085] As noted above, random sampling is aimed at splitting the references stream into two components— one consisting of long cache residencies, and another consisting of short transient ones. In this section we conduct a qualitative analysis of the effectiveness of random sampling in splitting the distribution of memory references. Fig. 10 compares distributions of reference masses— the fraction of references serviced by each residency length of the filtered 16K cache and the original 16K direct-mapped cache. Results are shown for select SPEC benchmarks with Bernoulli success probabilities of 0.05 for data streams and 0.0005 for instruction streams. These probabilities were chosen based on the results described in Sect. 1.
[0086] Each plot shows three lines: the distributions for the cache and filter for the filtered design, and the distribution for a conventional direct-mapped cache— which is the combination of the first two (this is the same distribution as the one shown in Fig. 5). The median value of each distribution is marked with a down pointing arrow. Invariably, the distributions show that the majority of references directed at the filter, are serviced by residencies much shorter than those serving the majority of the references directed at the cache proper.
[0087] To estimate the difference between the two resulting distributions we used two intuitive metrics: median ratio (marked with a horizontal double arrow) and false-* equilibrium (marked with a vertical double arrow). [0088] The first metric is the ratio between the median values of the cache and the filter: ratio = medianc / mediae . This metric is used to quantify the distinction between the two distributions, thereby evaluating the effectiveness of random selection to distinguish shorter residencies— which should stay in the filter — from longer ones that should be promoted into the cache proper. [0089] In the benchmarks shown, the median ratios range from 100 to
10,000. In fact, the average medians ratio for all data streams is ~320, and ~50,000 for the instruction streams — indicating a clear distinction between residency lengths in the cache and the filter,
[0090] The second metric is denoted as the false-* equilibrium, and is an estimate of false predictions: Any given residency length threshold we choose in hindsight will show up on the plot as a vertical line, with a fraction of the cache's distribution to its left indicating the false-positives (short residencies promoted to the cache), and a fraction of the filter's distribution to its right indicating the false-negatives (long residencies remaining in the filter). Obviously, choosing another threshold will either: increase the fraction of false-positives and decrease the fraction of false-negatives or vice versa. The false-* equilibrium is a unique threshold that if chosen, generates equal percentages of false-positives and false-negatives, thereby serving as an upper bound for overall percentage of false predictions. [0091] For example, if we examine vortex's data stream we see that the false-* equilibrium point stands at a residency length of ~20 and generates ~6% false predictions (~2% for the instruction stream). The false prediction rate for facerec's data stream was found to be ~15%, with a negligible number of false predictions for the instruction stream. The overall average percentage of false predictions for the data streams was found to be ~13%, with ~2% for the instruction streams — a fairly good upper bound considering it is based on true random sampling.
[0092] Another aspect of the reference distributions is the number of references accounting each distribution, compared with the number of residencies served by the cache and the filter. Fig. 11 shows the percentage of references serviced by the cache, compared with the percentage of blocks promoted into the cache, for various probabilities. Considering the mass-count disparity we expect that promoting frequently accessed blocks into the cache will result in a substantial increase in the number of references it will service, and that promoting not-so-frequently used blocks have a smaller impact on the number references serviced by the cache. This is indeed evident in Fig.
11: when increasing the success probabilities we see a distinctive increase in the number of references serviced by the cache, until some level — indicated by the horizontal line — where this increase slows dramatically and promoting more blocks into the cache hardly increases the cache's hit-rate. In our case this saturation occurs at P = 0.2 for the data and P = 0.05 for the instructions. Beyond these probabilities the promoted blocks are mostly transient blocks and we start experiencing diminishing returns.
[0093] In summary, we see that random sampling is very effective in splitting the distribution of references into two distinct components— one composed mainly of frequently used blocks, and the other composed mainly of transient blocks. The Set Look-aside Buffer
[0094] A fully-associative filter introduces longer access latencies and increased power consumption. We therefore suggest a set look-aside buffer (SLB) to cache recent lookup results. The SLB is a small direct-mapped cache structure mapping block tags directly to the filter's SRAM based data store, thus avoiding the majority of the costly CAM lookups, while still maintain fully- associative semantics. This section explores the SLB design space. [0095] Fig. 12 shows the stack depths distributions of filter accesses, for the different SPEC benchmarks, as well as the average distribution over all benchmarks (the various benchmarks are not individually marked as only the clustering of distributions matters in this context). Clearly, the vast majority of accesses target recently used blocks — in fact, on average ~94% of accesses are to stack depths of 8 or less, out of a total of 32 lines in the filter. In our experiments, we have found that using an 8 entry SLB achieves an average of ~78% hit-rate for the data stream (~83% median) and over 97% for the instruction stream (~97% median) for a 2K filter. Doubling the SLB size to 16 entries only improved the average data hit-rate to ~84% (~89% median) and ~99% for the instruction stream, but increased the dynamic power consumption by ~10% and the leakage by ~50% (with similar results for the 32K configuration).
[0096] We have therefore used an 8 entry SLB in our power and performance evaluation, eliminating almost 80% of the costly filter CAM lookups. Impact on Power and Performance
[0097] The reduced miss-rate achieved by the random sampling design, combined with a low-latency, low-power, direct-mapped cache potentially offers both improved performance and reduced power consumption. Augmenting the fully-associative filter with an SLB can reduce the overheads incurred by the filter, further improving efficiency.
[0098] Using the SimpleScalar toolset [1] for out-of-order simulations we have compared the performance achieved by direct-mapped filtered caches against various set-associative caches. Our micro-architecture consisted of a 4-wide superscalar design, whose parameters are listed in Table 1. The hit latency incurred by the direct-mapped Ll cache was set to 1 cycle for a cache hit and 3 cycles for a filter hit. If the request block is found in the SLB then no CAM lookup is necessary— enabling direct SRAM access and a total 2 cycles filter latency. The hit latency incurred by set-associative caches was set to 2 cycles. For fully-associative caches we used an unrealistically fast 2 cycle latency — same as set-associative — placing both on a similar baseline thus focusing on the reduced miss-rates achieved by fully-associative caches. [0099] Fig. 13 shows the IPC improvement achieved by a random sampling cache over a similar size 4-way associative cache, for the SPEC benchmarks. The figure shows consistent improvements (up to ~35% for a 16K random sampling cache and ~28% for 32K one), with an average overall IPC improvement of over ~10% for both 16K and 32K configurations.
Figure imgf000032_0001
* Ll latency is 2 cycles for set-associative and fully-associative caches
Bernoulli probabilities
Size Data Instruction
16K P = 0.05 P = 0.0005
32K P = 0.1 P = 0.0005
Table 1. Micro-architecture and cache configurations used in die out-of-order simulations
[00100] Hg. 14 compares the average performance achieved with 16K and 32K random sampling caches to that of common cache structures. It shows that a direct-mapped random sampling filtered cache achieves significantly better performance than a similar size set-associative cache. Moreover, a random sampling cache can even gain better overall performance than larger, more expensive caches. For example, the IPC of a 16K-DM random sampling cache is 5% higher than that of a 32K-4way cache; a 32K-DM random sampling cache is more than 7% higher than 64K-4way. Likewise, using the extra 2K for a filter yields better performance than using them as a victim buffer, indicating that even such a relatively large victim buffer may be swamped by transient blocks.
[00101] Interestingly, the IPC improvement is similar when comparing the 16K-DM random sampling cache to both a regular 16K-DM cache and a 16K-
4way set-associative cache, indicating similar performance achieved by the latter two.
[00102] The reason for this similarity is that while the direct-mapped cache suffers from a higher miss-rate compared to the 4-way set-associative cache, it compensates with its lower access latency. This is even more evident when considering the larger 32K and 64K caches, where the direct-mapped configuration takes the lead. When doubling the cache size from the 32K to 64K the number of cache sets doubles, thus reducing the number of conflicts and allowing the direct-mapped cache's lower latency to prevail. [00103] Next, we compare the power consumption of the random sampling cache with that of the other configurations. Using independent random sampling eliminates the need to maintain any previous reuse information, reducing the power consumption calculation to averaging the energies consumed by the combination of a direct-mapped cache, a fully-associative filter, and a small, direct-mapped SLB. All power consumption estimates are based on the CACTI 4.1 power model.
[00104] The average dynamic energy consumption is simply aggregate energy — the sum of number of accesses x access energy for each component- divided by the overall number of hits. Even simpler, the leakage power consumed by the random sampling cache is the sum of leakage power consumed by all components.
[00105] Rg. 15 shows both dynamic read energy and leakage power consumed by the random sampling cache, compared to common cache configurations (same as those in Fig. 14). Obviously, the power consumed by the random sampling cache is higher than that of a simple direct-mapped cache, because of the fully-associative filter: ~30% more dynamic energy and ~15% more leakage power for a 16K random sampling cache (and just over half that for a 32K cache). However, when comparing a random sampling cache to a more common 4-way associative cache of a similar size, the 16K random sampling cache design consumes almost 70%-80% less dynamic energy, with only 5% more leakage power. The 32K configuration yields 60%-
70% reduction in dynamic energy, with no increase in leakage. [00106] However, the main contribution of a random sampling cache is apparent when compared to a set-associative cache double its size: both the 16K and 32K random sampling caches consume 70%-80% less dynamic energy, and 40%-50% less leakage than 32K and 64K 4-way set-associative caches, respectively — while yielding better performance as shown in Fig. 14. [00107] In summary, this section shows that a random sampling direct- mapped cache offers performance superior to that of a double sized set- associative cache, while consuming considerably less power— both dynamic and static.
SUMMARY OF RESULTS
[00108] According to some embodiments of the present invention, there is provided a method of distinguishing transient blocks from frequently used blocks, thereby enabling servicing references to transient blocks from a small fully-associative auxiliary cache structure. By inserting only frequently used blocks into the main cache structure the number of conflict misses may be reduced, thus allowing for the use of direct mapped caches for the main cache. A probabilistic filtering mechanism may use random sampling to identify and select the frequently used blocks. By using a small direct-mapped lookup table to cache the most recently accessed blocks in the auxiliary cache, fully-associative lookups may be reduced. According to some embodiments of the present invention a 16K direct-mapped Ll cache, augmented with a fully- associative 2K filter, arranged and operated according to some embodiments of the present invention may achieve on average over 10% more instructions per cycle than a regular 16K, 4-way set-associative cache, and even ~5% more IPC than a 32K, 4-way cache, while consuming 70%-80% less dynamic power than either of them. As should be clear to one of skill in the art, similar and/or better results may be achieved for large caches.
[00109] While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims

CLAIMSWhat is claimed:
1. A data cache for providing caching to a caching client, said data cache comprising: a first cache portion adapted to be operated according to a first caching algorithm; a second cache portion adapted to be operated according to a second caching algorithm; and cache control logic adapted to insert data requested by the caching client into said second caching portion based on either a probabilistic insertion policy or on a predefined sampling pattern.
2. The data cache according to claim 1, wherein said first and second caches are each comprised of at least one set of data cells, and wherein at least some of the data cells are adapted to store data retrieved from an external data source along with information indicating a location in the external data source from which the cached data was retrieved.
3. The data cache according to claim 2, wherein upon receiving a data block request from the caching client, said control logic is adapted to first scan said second cache portion for the requested data.
4. The data cache according to claim 3, wherein said control logic is adapted to scan said first cache portion for the requested data if it is not found in the second cache.
5. The data cache according to claim 1, wherein said first cache includes a recent activity sub-buffer.
6. The data cache according to claim 5, wherein said control logic is adapted to check the recent activity sub-buffer of said first cache.
7. The data cache according to claim 1, wherein said cache control logic is adapted to insert data requested by the caching client into said second caching portion based on a probabilistic insertion policy.
8. The data cache according to claim 1, wherein said cache control logic is adapted to insert data requested by the caching client into said second caching portion based on a predefined sampling pattern.
9. A data cache for providing caching to a caching client, said data cache comprising:
At least one set of data cells, at least some of which data cells are adapted to store cached data retrieved from an external data source; and a sub-buffer adapted to store information regarding at least some of the prior operations performed on the data cache and at least one pointer to data operated upon in the cache.
10. The data cache according to claim 9, wherein at least some of said data cells are also adapted to store information relating to a location in external data source from which the cached data was retrieved.
11. A method for providing caching to a caching client, said method comprising: determining whether to insert data requested by the caching client into either a first or a second caching portions based on a probabilistic insertion policy or based on a predefined sampling pattern.
12. The method according to claim 11, further comprising upon receiving a data block request from the caching client, first scanning the second cache portion for the requested data.
13. The method according to claim 12, further comprising scanning the first cache portion for the requested data if it is not found in the second cache.
14. The method according to claim 11, wherein determining is based on a probabilistic insertion policy.
15. The method according to claim 11, wherein determining is based on a predefined sampling pattern.
16. A cache controller comprising: logic adapted to insert data requested by a caching client into a long- term cache based on a probabilistic insertion policy or on a predefined sampling policy.
17. The cache controller according to claim 16, wherein said control logic is further adapted to first check the long-term cache upon receiving a request for a data block from the caching client.
18. The cache controller according to claim 17, wherein said caching logic is further adapted to check a bypass cache in event request data is not found in the long-term cache.
19. The cache controller according to claim 18, wherein said control logic is further adapted to retrieve the requested data from external memory if the requested data is not found in either the bypass or long-term caches.
20. The cache controller according to claim 16, wherein said control logic is further adapted to execute a probabilistic insertion policy each time requested data is found outside the long-term cache.
21. The cache controller according to claim 16, wherein said control logic is further adapted to execute an insertion policy based on a predefined sampling pattern each time requested data is found outside the long- term cache.
22. A method for inserting requested data into a cache, said method comprising: executing either a probabilistic insertion policy or an insertion policy based on a predefined sampling pattern each time requested data is found outside the long-term cache.
23. The method according to claim 22, wherein a probabilistic insertion policy is executed.
24. The method according to claim 22, wherein a predefined sampling pattern based insertion policy is executed.
PCT/IL2008/000750 2007-06-04 2008-06-03 Method architecture circuit & system for providing caching WO2008149348A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US94177907P 2007-06-04 2007-06-04
US60/941,779 2007-06-04

Publications (2)

Publication Number Publication Date
WO2008149348A2 true WO2008149348A2 (en) 2008-12-11
WO2008149348A3 WO2008149348A3 (en) 2010-02-25

Family

ID=40094274

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2008/000750 WO2008149348A2 (en) 2007-06-04 2008-06-03 Method architecture circuit & system for providing caching

Country Status (1)

Country Link
WO (1) WO2008149348A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8897766B2 (en) 2013-02-19 2014-11-25 International Business Machines Corporation System of edge byte caching for cellular networks
US20180189192A1 (en) * 2016-12-29 2018-07-05 Intel Corporation Multi level system memory having different caching structures and memory controller that supports concurrent look-up into the different caching structures
US11416405B1 (en) 2020-02-07 2022-08-16 Marvell Asia Pte Ltd System and method for mapping memory addresses to locations in set-associative caches
US11836053B2 (en) 2021-09-27 2023-12-05 Hewlett Packard Enterprise Development Lp Resource allocation for synthetic backups

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6349365B1 (en) * 1999-10-08 2002-02-19 Advanced Micro Devices, Inc. User-prioritized cache replacement
US20060015686A1 (en) * 2004-07-14 2006-01-19 Silicon Optix Inc. Cache memory management system and method
US20060236020A1 (en) * 2003-03-11 2006-10-19 Taylor Michael D Cache memory architecture and associated microprocessor and system controller designs

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6349365B1 (en) * 1999-10-08 2002-02-19 Advanced Micro Devices, Inc. User-prioritized cache replacement
US20060236020A1 (en) * 2003-03-11 2006-10-19 Taylor Michael D Cache memory architecture and associated microprocessor and system controller designs
US20060015686A1 (en) * 2004-07-14 2006-01-19 Silicon Optix Inc. Cache memory management system and method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8897766B2 (en) 2013-02-19 2014-11-25 International Business Machines Corporation System of edge byte caching for cellular networks
US20180189192A1 (en) * 2016-12-29 2018-07-05 Intel Corporation Multi level system memory having different caching structures and memory controller that supports concurrent look-up into the different caching structures
US10915453B2 (en) * 2016-12-29 2021-02-09 Intel Corporation Multi level system memory having different caching structures and memory controller that supports concurrent look-up into the different caching structures
US11416405B1 (en) 2020-02-07 2022-08-16 Marvell Asia Pte Ltd System and method for mapping memory addresses to locations in set-associative caches
US11620225B1 (en) 2020-02-07 2023-04-04 Marvell Asia Pte Ltd System and method for mapping memory addresses to locations in set-associative caches
US11836053B2 (en) 2021-09-27 2023-12-05 Hewlett Packard Enterprise Development Lp Resource allocation for synthetic backups

Also Published As

Publication number Publication date
WO2008149348A3 (en) 2010-02-25

Similar Documents

Publication Publication Date Title
US10409725B2 (en) Management of shared pipeline resource usage based on level information
Seshadri et al. The dirty-block index
US6983356B2 (en) High performance memory device-state aware chipset prefetcher
Chaudhuri Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches
US7454573B2 (en) Cost-conscious pre-emptive cache line displacement and relocation mechanisms
US6823428B2 (en) Preventing cache floods from sequential streams
US8041897B2 (en) Cache management within a data processing apparatus
US6212602B1 (en) Cache tag caching
JP6009589B2 (en) Apparatus and method for reducing castout in a multi-level cache hierarchy
US20080022049A1 (en) Dynamically re-classifying data in a shared cache
US20140281248A1 (en) Read-write partitioning of cache memory
Gupta et al. Adaptive cache bypassing for inclusive last level caches
Basu et al. Scavenger: A new last level cache architecture with global block priority
US20180300258A1 (en) Access rank aware cache replacement policy
US20100011165A1 (en) Cache management systems and methods
Gaur et al. Base-victim compression: An opportunistic cache compression architecture
Das et al. SLIP: reducing wire energy in the memory hierarchy
WO2001088716A1 (en) Method for controlling cache system comprising direct-mapped cache and fully-associative buffer
WO2008149348A2 (en) Method architecture circuit & system for providing caching
US11036639B2 (en) Cache apparatus and method that facilitates a reduction in energy consumption through use of first and second data arrays
Lee et al. A new cache architecture based on temporal and spatial locality
Backes et al. The impact of cache inclusion policies on cache management techniques
Hameed et al. Rethinking on-chip DRAM cache for simultaneous performance and energy optimization
Olson et al. Revisiting stack caches for energy efficiency
US6601155B2 (en) Hot way caches: an energy saving technique for high performance caches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08763508

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08763508

Country of ref document: EP

Kind code of ref document: A2