WO2013180714A1 - Local error detection and global error correction - Google Patents

Local error detection and global error correction Download PDF

Info

Publication number
WO2013180714A1
WO2013180714A1 PCT/US2012/040108 US2012040108W WO2013180714A1 WO 2013180714 A1 WO2013180714 A1 WO 2013180714A1 US 2012040108 W US2012040108 W US 2012040108W WO 2013180714 A1 WO2013180714 A1 WO 2013180714A1
Authority
WO
WIPO (PCT)
Prior art keywords
gec
error
cache line
data
information
Prior art date
Application number
PCT/US2012/040108
Other languages
French (fr)
Inventor
Aniruddha Nagendran UDIPI
Naveen Muralimanohar
Norman Paul Jouppi
Alan Lynn Davis
Rajeev Balasubramonian
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to US14/396,327 priority Critical patent/US9600359B2/en
Priority to PCT/US2012/040108 priority patent/WO2013180714A1/en
Priority to KR1020147030518A priority patent/KR101684045B1/en
Priority to CN201280072858.8A priority patent/CN104246898B/en
Priority to EP12877868.5A priority patent/EP2856471A4/en
Priority to TW102117744A priority patent/TWI501251B/en
Publication of WO2013180714A1 publication Critical patent/WO2013180714A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1012Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1064Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in cache or content addressable memories
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/108Parity data distribution in semiconductor storages, e.g. in SSD
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0411Online error correction

Definitions

  • dual in-line memory modules with narrow chips (i.e., I/O DRAM x4 chips) consume more energy than those with wider I/O chips (i.e., x8, x16, or x32 chips).
  • FIG. 1 is a block diagram of a system including a memory controller according to an example.
  • FIG. 2 is a block diagram of a memory module according to an example.
  • FIG. 3 is a block diagram of a memory module rank according to an example.
  • FIG. 4 is a block diagram of a cache line including a surplus bit according to an example.
  • FIG. 5 is a flow chart based on checking data fidelity according to an example.
  • FIG. 6 is a flow chart based on performing error detection and/or correction according to an example.
  • Examples described herein can use a two-tier protection scheme that separates out error detection and error correction functionality. Codes, such as those based on checksum and parity, can be used effectively to provide strong fault-tolerance with little or no overhead.
  • Storage such as system firmware, may be used to direct a memory controller to store some correction codes in DRAM data memory. The memory controller may be modified to handle data mapping, error detection, and correction. Novel application of error detection/correction codes, and novel physical data mapping to memory, can allow a commodity memory module (e.g., ECC DIMM x4, x8, x16, x32 etc.) to provide chipkill functionality without increasing the fetch width and/or the storage overhead.
  • locality and DRAM row buffer hit rates may be further improved by placing the data and the ECC codes in the same row buffer.
  • an effective fault-tolerance mechanism is provided, enabling strong reliability guarantees, activating as few chips as possible to conserve energy and improve performance, reducing circuit complexity, and working with wide I/O DRAM chips such as x8, x16, or x32.
  • FIG. 1 is a block diagram of a system 100 including a memory controller 102 according to an example.
  • System 100 in response to a memory read operation 140, is to apply local error detection 120 and/or global error correction 130 to detect and/or correct an error 104 of a cache line segment 1 19 of a rank 1 12 of memory.
  • system 100 is to compute local error detection (LED) 120 information per cache line segment 1 19 of data.
  • the cache line segment 1 19 is associated with a rank 1 12 of memory.
  • the LED 120 is to be computed based on an error detection code.
  • the system 100 is to generate a global error correction (GEC) for the cache line segment, based on a global parity.
  • GEC global error correction
  • the system 100 is to check data fidelity in response to memory read operation 140, based on the LED 120 information, to identify a presence of an error 104 and the location of the error 104 among cache line segments 1 19 of the rank 1 12.
  • the system 100 is to correct the cache line segment 1 19 having the error 104 based on the GEC, in response to identifying the error 104.
  • system 100 is to perform local error detection (LED) 120 in response to a memory read operation 140, based on a checksum computed over a cache line segment 1 19, to detect a location of an error 104 at a chip-granularity among N data chips in a rank 1 12.
  • the system 100 is to perform global error correction (GEC) 130 over the cache line segment 1 19 on the N data chips in the rank 1 12 in response to detecting the error 104.
  • the system 100 is to perform the GEC 130 using a global parity to generate GEC information, and reconstruct data segments 1 19 having the error 104, based on error-free segments and the GEC information.
  • system 100 may use simple checksums and parity operations to build a two-layer fault tolerance mechanism, at a level of granularity down to a segment 1 19.
  • the first layer of protection is local error detection (LED) 120, a check (e.g., an immediate check that follows a read operation 140) to verify data fidelity.
  • the LED 120 can provide chip-level error detection (for chipkill, i.e., the ability to withstand the failure of an entire DRAM chip), by distributing LED infornnation 120 across a plurality of chips in a memory module.
  • the LED information 120 may be associated, not only with each cache line as a whole, but with every cache line "segment,” i.e., the fraction of the line present in a single chip in the rank.
  • a relatively short checksum (1 's complement, Fletcher's sums, or other) may be used as the error detection code, and may be computed over the segment and appended to the data.
  • the error detection code may be based on other types of error detection and/or error protection codes, such as cyclic redundancy check (CRC), Bose, Ray-Chaudhuri, and Hocquenghem (BCH) codes, and so on.
  • CRC cyclic redundancy check
  • Bose Bose
  • Ray-Chaudhuri Ray-Chaudhuri
  • BCH Hocquenghem
  • the second layer of protection may be applied, the Global Error Correction (GEC) 130.
  • the GEC 130 may be based on a parity, such as an XOR-based global parity across the data segments 1 19 on the N data chips in the rank 1 12.
  • the GEC 130 also may be based on other error detection and/or error protection codes, such as CRC, BCH, and others.
  • the GEC results may be stored in either the same row as the data segments, or in a separate row that is to contain GEC information for several data rows. Data may be reconstructed based on reading out the fault- free segments and the GEC segment, and location information (e.g., an identification of the failed chip based on the LED 120).
  • the LED 120 and GEC 130 may be computed over the data words in a single cache line. Thus, when a dirty line is to be written back to memory from the processor, there is no need to perform a "read-before-write," and both codes can be computed directly, thereby avoiding impacts to write performance. Furthermore, LED 120 and/or GEC 130 may be stored in regular data memory, in view of a commodity memory system that may provide limited redundant storage for Error-Correcting Code (ECC) purposes. An additional read/write operation may be used to access this information along with the processor- requested read/write. Storing LED information in the provided storage space within each row may enable it to be read and written in tandem with the data line. GEC information can be stored in data memory in a separate cache line since it can be accessed in the very rare case of an erroneous data read. Appropriate data mapping can locate this in the same row buffer as the data to increase locality and hit rates.
  • ECC Error-Correcting Code
  • the memory controller 102 may provide data mapping, LED 120/GEC 130 computation and verification, perform additional reads if required, etc.
  • system 100 may provide full functionality transparently, without a need to notify and/or modify an Operating System (OS) or other computing system components. Setting apart some data memory to store LED 120/GEC 130 may be handled through minor modifications associated with system firmware, e.g., reducing a reported amount of available memory storage to accommodate the stored LED 120/GEC 130 transparently from the OS and application perspective.
  • OS Operating System
  • FIG. 2 is a block diagram of a memory module 210 according to an example.
  • the memory module 210 may interface with memory controller 202.
  • SDRAM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM Synchronous Dynamic Random Access Memory
  • DIMM dual in-line memory
  • a rank may be divided into, e.g., 4-16 banks.
  • the portion of each rank 212/bank 214 in a chip 216 is a segment 219.
  • the chips 216 in the rank 212 are activated and each segment 219 contributes a portion of the requested cache line.
  • a cache line is striped across multiple chips 216.
  • the cache line transfer can be realized based on a burst of 8 data transfers.
  • a chip may be an xN part, e.g., x4, x8, x16, x32, etc.
  • Each segment of a bank 214 may be partitioned into N arrays 218 (four are shown).
  • Each array 218 can contribute a single bit to the N-bit transfer on the data I/O pins for that chip 216.
  • employing wider I/O DRAM parts such as x8, x16, or x32 may decrease the number of DRAM chips 216 needed to achieve a given data bus width, creating extra space on the DIMM for more chips 216, thereby increasing the number of independent banks 214 available.
  • Each chip 216 may be used to store data 21 1 , information from LED 220, and information from GEC 230. Accordingly, each chip 216 may contain a segment 219 of data 21 1 , LED 220, and GEC 230 information. This is in contrast to how a conventional 9-chip ECC memory module is used, where 8 chips are used for data and the 9 th chip is used for ECC information. Accordingly, the exemplary uses described herein provide robust chipkill protection, because each chip can include the data 21 1 , LED 220, and GEC 230 for purposes of identifying and correcting errors.
  • the example of FIG. 2 illustrates functionality with just a single rank of nine x8 chips, improving access granularity, energy consumption, and performance. Further, the example can support chipkill protection at very high ratios, such as the ability to handle 1 dead chip in 9, significantly boosting reliability guarantee (in contrast to conventional support of, e.g., 1 dead chip in 36).
  • Examples described herein can allow several-fold reduction in the number of chips activated per memory access. This helps reduce dynamic energy consumption by eliminating overfetch at least to that extent, and helps reduce static energy consumption by allowing unused chips to be put in low- power modes. In addition to the energy advantages, reducing access granularity increases rank-level and bank-level parallelism. This enables substantial performance gains. Examples described herein impose no restrictions on DRAM parts, DIMM layout, DDR protocol, burst length, etc., and may be adapted to x8, x16 or x32 DRAMs, allowing servers to exploit advantages of those memory configurations.
  • Examples may be achieved with non-intrusive modifications to system designs, because an example memory controller, and to a smaller extent a memory firmware, may be modified to provide support for the examples. Examples may utilize additive checksums and parity calculations to provide strong fault tolerance without a need for Galois field arithmetic over 16-bit or 32-bit symbols or other increased complexity, latency, and energy consumption.
  • examples herein provide benefits without a need for a specially designed DRAM chip microarchitecture, e.g., a DRAM having a special area provisioned to store ECC information and utilizing a localized data mapping architecture that would impose significant penalties on write performance if chipkill-level reliability is enabled. Further, there is no need for using conventional symbol-based ECC codes that have constraints with respect to DIMM and rank organization. Examples are implementation friendly, without a need for modifications to several components of a computing system, because examples may be transparent to a computing system's operating system, memory management unit, caches, etc.
  • FIG. 3 is a block diagram of a memory module rank 312 according to an example.
  • the rank 312 may include N chips, e.g., nine x8 DRAM chips 316 (chip 0 ... chip 8), and a burst length of 8. In alternate examples, other numbers/combinations of N chips may be used, at various levels of xN and burst lengths.
  • the data 31 1 , LED 320, and GEC 330 can be distributed throughout the chips 316 of the rank 312.
  • LED 320 can perform an immediate check following every read operation to verify data fidelity. Additionally, LED 320 can identify a location of the failure, at a chip-granularity within rank 312. To ensure such chip-level detection (usable for chipkill), the LED 320 can be maintained at the chip level - associated with more specificity than an entire cache line as a whole (as in symbol-based ECC codes), at every cache line "segment," the fraction of the line present in a single chip 316 in the rank 312. Cache line A is divided into segments A 0 through A 8 , with the associated local error detection codes L A o through L A8 .
  • a cache line may be associated with 64 bytes of data, or 512 data bits, associated with a data operation, such as a memory request. Because 512 data bits (one cache line) in total are needed, each chip is to provide 57 bits towards the cache line.
  • An x8 chip with a burst length of 8 supplies 64 bits per access, which are interpreted as 57 bits of data (A 0 in Figure 3, for example), and 7 bits of LED information 320 associated with those 57 bits (L A o)-
  • a physical data mapping policy may be used to ensure that LED bits 320 and the data segments 31 1 they protect are located on the same chip 316.
  • LED code 320 There are no performance penalties on either reads or writes due to the LED code 320. Every cache line access also reads/writes its corresponding LED information. Since the LED 320 is "self-contained," i.e., it is constructed from bits belonging to exactly one cache line, no read-before-write is needed - all bits used to build the code are already at the memory controller before a write.
  • error detection code for the LED 320 can depend on an expected failure mode. For example, a simple 1 's complement addition checksum may be used for a range of expected failure modes, including the most common/frequent modes of memory failure.
  • the GEC 330 also referred to as a Layer 2 Global Error Correction code, is to aid in the recovery of lost data once the LED 320 (Layer 1 code) detects an error and indicates a location of the error.
  • the Layer 2 GEC 330 may be comprised of three tiers.
  • the GEC 330 code may be a 57-bit entity, and may be provided as a column-wise XOR parity of nine cache line segments, each a 57-bit field from the data region.
  • its GEC 330 may be a parity, such as a parity PA that is a XOR of data segments A 0 , A-i , A 8 .
  • Data reconstruction from the GEC 330 code may be a non-resource intensive operation (e.g., an XOR of the error-free segments and the GEC 330 code), as the erroneous chip 316 can be flagged by the LED 320.
  • the GEC code may be stored in data memory itself, in contrast to using a dedicated ECC chip.
  • the available memory may be made to appear smaller than it physically is (e.g., by 12.5% overhead for storing LED 320 and/or GEC 330) from the perspective of the operating system, via firmware modifications or other techniques.
  • the memory controller also may be aware of the changes to accommodate the LED 320 and/or GEC 330, and may map data accordingly (such as mapping to make the LED 320 and/or GEC 330 transparent to the OS, applications, etc.).
  • the GEC 330 code may be placed in the same rank as its corresponding cache line.
  • a specially-reserved region (lightly shaded GEC 330 in FIG. 3) in each of the nine chips 316 in the rank 312 may be set aside for this purpose.
  • the specially-reserved region may be a subset of cache lines in every DRAM page (row), although it is shown as a distinct set of rows in FIG. 3 for clarity. This co- location may ensure that any reads or writes to the GEC 330 information will be guaranteed to produce a row-buffer hit when made in conjunction with the read or write to the actual data cache line, thus reducing any potential impacts to performance.
  • FIG. 4 is a block diagram of a cache line 413 including a surplus bit 436 according to an example.
  • the GEC 430 information may be laid out in a reserved region across N chips (e.g., Chip 0...8), for an example as cache line A, also illustrated in FIG. 3.
  • the cache line 413 also may include parity 432, tiered parity 434, and surplus bit 436.
  • the 57-bit GEC 430 may be distributed among all N (i.e., nine) chips 419.
  • the first seven bits of the PA field ( ⁇ - ⁇ ) may be stored in the first chip 416 (Chip 0), the next seven bits (PA7-13) may be stored in the second chip (Chip 1 ), and so on.
  • Bits PA49-55 may be stored on the eighth chip (Chip 7).
  • the last bit, PA 56 may be stored on the ninth chip (Chip 8), in the surplus bit 436.
  • the surplus bit 436 may be borrowed from the Data+LED region of the N th chip (Chip 8), as set forth above regarding using only 512 bits of the available 513 bits (57 bits x 9 chips) to store the cache line.
  • the failure of a chip 416 also results in the loss of the corresponding bits in the GEC 430 information stored in that chip.
  • the GEC 430 code PA itself therefore, is protected by an additional parity 432, also referred to as the third tier PP A .
  • PPA in the illustrated example is a 7-bit field, and is the XOR of the N-1 other 7-bit fields, PA 0 - 6 , PA 7 -i 3> .... PA ⁇ -ss-
  • the parity 432 (PP A field) is shown stored on the N th (ninth) chip (Chip 8). If an entire chip 416 fails, the GEC 430 is first recovered using the parity 432 combined with uncorrupted GEC segments from the other chips. The chips 416 that are uncorrupted may be determined based on the LED, which can include an indication of an error's location, i.e., locate the failed chip). The full GEC 430 is then used to reconstruct the original data.
  • a code may be built using various permutations of bits from the different chips to form each of the T4 bits 434. This can include multiple bits from the same chip 416, and bits from different columns across chips 416 to maximize the probability of detection.
  • chips 0-7 (without loss of generality, e.g., N-1 chips) can contain 57 bits of data plus 7 bits of LED in the data region, and 7 bits of GEC 430 parity plus 1 bit of T4 information (tiered parity 434) in the GEC region.
  • Chip-8 (the N th chip) can contain 56 bits of data plus 7 bits of LED plus one surplus bit 436 in the data region, and 8 bits of parity (including the surplus bit borrowed from the data region) plus one bit of T4 information in the GEC region.
  • one of the first eight chips fails, 57 bits of data (A-i ) are lost, in addition to GEC parity information PA 7 - 3 .
  • the lost information can be recovered by reading A 0 - A 8 , and the LED associated with A1 (L A i ), indicates a chip error.
  • Read GEC segments PA 0 -6, PA-i 4 -2o, PA21-27, ⁇ PA 4 9-55, PA56 and PPA to recover the lost GEC bits PA 7 -i 3 , thereby reconstructing GEC parity PA.
  • data value A-i can be reconstructed, thus recovering the entire original cache line.
  • GEC 430 may be updated (which includes ⁇ , ⁇ , and T4) when data is modified.
  • each cache line write may be transformed into two writes - one to the data location (for a full 576 bits of data + LED + surplus bit) and another to its corresponding GEC location (72-bits).
  • 72 bits of GEC+T4 code may be updated per write, other constraints (e.g., the DDR3 protocol) may be associated with completing a burst of 8 per access (e.g., an entire 72-byte "cache line" size of data).
  • updates may be combined, e.g., as many as 8 different GEC updates into a single write command, to reduce some of the performance impact.
  • This is low-overhead since writes are already buffered and streamed out intermittently from the memory controller, and additional logic can easily be implemented at this stage to coalesce as many GEC writes as possible.
  • Performance impact is further minimized because the data mapping ensures that the GEC write is a row-buffer hit once the data line is written.
  • there is not a need for a read-before-write of the data cache lines themselves because bits contributing to the GEC code are from a single cache line, already available at the controller. This further minimizes performance impact.
  • data masking can be employed to write the appropriate bits into memory. Note that the complete burst of 8 may be performed nonetheless - some pieces of data are just masked out while actually writing to DRAM.
  • bits may be used: 63 bits of LED information, at 7 bits per chip; 57 bits of GEC parity, spread across the nine chips; 7 bits of third-level parity, PP X ; and 9 bits of T4 protection, 1 bit per chip. This adds up to a total of 136 bits out of 512 bits of the cache line, a storage overhead of 26.5%.
  • 12.5% may be provided by the 9 th chip added on to standard ECC DIMMs (e.g., making the 9 th chip available for general use, instead of reserving it for standard ECC-only operation), and the other 14% is stored in data memory in the GEC region.
  • GEC overhead may increase as well, because the global parity is a 103-bit entity computed over four 103-bit data segments, a storage overhead of 25%, with total overhead of approximately 50%.
  • storage overhead is prioritized, it can be fixed at about 12.5%, with a tradeoff of an increase in access granularity.
  • the GEC overhead remains approximately 25% (1 in 4 chips), for an overall ECC storage overhead of 37.5%.
  • Substantial power savings may be realized, compared to traditional chipkill mechanisms, through a reduction of both dynamic and static power. It is possible to activate the absolute minimum number of chips required to service a request, e.g., just nine x8 chips, for example, reading/writing exactly one 64- byte cache line in a standard 8-burst access. This is in contrast to conventional chipkill solutions that may cause forced prefetching and increase dynamic power consumption (e.g., by activating additional chips per read/write, accessing multiple cache lines per standard 8-burst access). Examples provided herein also may enable a reduction in activate power, because the size of the row buffer per chip may be constant, but fewer chips are being activated.
  • Activation power also may be reduced going from x4 chips to x8 chips, because fewer chips make up a rank.
  • the footprint of each activation also may be reduced, allowing unused rank/banks to transition into low-power modes, such as shallow low-power modes that can be entered into and exited from quickly.
  • FIG. 5 is a flow chart 500 based on checking data fidelity according to an example.
  • local error detection (LED) information is computed per cache line segment of data associated with a rank of a memory, based on an error detection code.
  • GEC global error correction
  • data fidelity is checked in response to a memory read operation, based on the LED information, to identify a presence of an error and the location of the error among cache line segments of the rank.
  • the cache line segment having the error is corrected based on the GEC, in response to identifying the error.
  • FIG. 6 is a flow chart 600 based on performing error detection and/or correction according to an example.
  • a local error detection (LED) is performed in response to a memory read operation, based on a checksum computed over a cache line segment, to detect a location of an error at a chip- granularity among N data chips in a rank.
  • a global error correction (GEC) is performed over the cache line segment on the N data chips in the rank in response to detecting the error, the GEC performed using a global parity to generate GEC information.
  • data segments having the error are reconstructed, based on error-free segments and the GEC information.
  • the GEC information is updated in response to a write operation.
  • a tiered parity is generated to protect the GEC information, wherein the tiered parity is stored on an Nth chip and is to be used to recover the GEC information based on GEC information segments from a plurality of chips.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

A system may use local error detection (LED) and global error correction (GEC) information to check data fidelity and correct an error. The LED may be calculated per cache line segment of data associated with a rank of a memory. Data fidelity may be checked in response to a memory read operation, based on the LED information, to identify a presence of an error and the location of the error among cache line segments of the rank. The cache line segment having the error may be corrected based on the GEC, in response to identifying the error.

Description

LOCAL ERROR DETECTION AND GLOBAL ERROR
CORRECTION
BACKGROUND
[0001] Mennory system reliability is a serious and growing concern in modern servers and blades. Existing memory protection mechanisms require one or more of the following: activation of a large number of chips on every memory access, increased access granularity, and an increase in storage overhead. These lead to increased dynamic random access memory (DRAM) access times, reduced system performance, and substantially higher energy consumption. Current commercial chipkill-level reliability mechanisms may be based on conventional Error-Correcting Code (ECC) such as Reed-Solomon (RS)-codes, symbol based codes etc. However, current ECC codes restrict memory system design to use of x4 DRAMs. Further, for a given capacity, dual in-line memory modules (DIMMs) with narrow chips (i.e., I/O DRAM x4 chips) consume more energy than those with wider I/O chips (i.e., x8, x16, or x32 chips).
[0002] This non-availability of efficient chipkill mechanisms is one reason for the lack of adoption of wide input/output (I/O) DRAMs despite the advantages they offer. Second, current ECC codes are computed over large data words to increase coding efficiency. This ECC code handling results in large access granularities, activating a large number of chips or even ranks for every memory operation, and increased energy consumption. Area, density, and cost constraints can lead to overfetch to some extent within a rank of chips, but imposing additional inefficiency in order to provide fault tolerance should be avoided. The handling may potentially reduce bank-level and rank-level parallelism, which diminishes the ability of DRAM to supply data to high bandwidth I/O such as photonic channels. Finally, conventional ECC codes employ complex Galois field arithmetic that is inefficient in terms of both latency and circuit area.
BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES
[0003] FIG. 1 is a block diagram of a system including a memory controller according to an example.
[0004] FIG. 2 is a block diagram of a memory module according to an example.
[0005] FIG. 3 is a block diagram of a memory module rank according to an example.
[0006] FIG. 4 is a block diagram of a cache line including a surplus bit according to an example.
[0007] FIG. 5 is a flow chart based on checking data fidelity according to an example.
[0008] FIG. 6 is a flow chart based on performing error detection and/or correction according to an example.
DETAILED DESCRIPTION
[0009] Examples described herein can use a two-tier protection scheme that separates out error detection and error correction functionality. Codes, such as those based on checksum and parity, can be used effectively to provide strong fault-tolerance with little or no overhead. Storage, such as system firmware, may be used to direct a memory controller to store some correction codes in DRAM data memory. The memory controller may be modified to handle data mapping, error detection, and correction. Novel application of error detection/correction codes, and novel physical data mapping to memory, can allow a commodity memory module (e.g., ECC DIMM x4, x8, x16, x32 etc.) to provide chipkill functionality without increasing the fetch width and/or the storage overhead. Further, locality and DRAM row buffer hit rates may be further improved by placing the data and the ECC codes in the same row buffer. Thus, an effective fault-tolerance mechanism is provided, enabling strong reliability guarantees, activating as few chips as possible to conserve energy and improve performance, reducing circuit complexity, and working with wide I/O DRAM chips such as x8, x16, or x32.
[0010] FIG. 1 is a block diagram of a system 100 including a memory controller 102 according to an example. System 100, in response to a memory read operation 140, is to apply local error detection 120 and/or global error correction 130 to detect and/or correct an error 104 of a cache line segment 1 19 of a rank 1 12 of memory.
[0011] In an example, system 100 is to compute local error detection (LED) 120 information per cache line segment 1 19 of data. The cache line segment 1 19 is associated with a rank 1 12 of memory. The LED 120 is to be computed based on an error detection code. The system 100 is to generate a global error correction (GEC) for the cache line segment, based on a global parity. The system 100 is to check data fidelity in response to memory read operation 140, based on the LED 120 information, to identify a presence of an error 104 and the location of the error 104 among cache line segments 1 19 of the rank 1 12. The system 100 is to correct the cache line segment 1 19 having the error 104 based on the GEC, in response to identifying the error 104.
[0012] In an alternate example, system 100 is to perform local error detection (LED) 120 in response to a memory read operation 140, based on a checksum computed over a cache line segment 1 19, to detect a location of an error 104 at a chip-granularity among N data chips in a rank 1 12. The system 100 is to perform global error correction (GEC) 130 over the cache line segment 1 19 on the N data chips in the rank 1 12 in response to detecting the error 104. The system 100 is to perform the GEC 130 using a global parity to generate GEC information, and reconstruct data segments 1 19 having the error 104, based on error-free segments and the GEC information.
[0013] Thus, system 100 may use simple checksums and parity operations to build a two-layer fault tolerance mechanism, at a level of granularity down to a segment 1 19. The first layer of protection is local error detection (LED) 120, a check (e.g., an immediate check that follows a read operation 140) to verify data fidelity. The LED 120 can provide chip-level error detection (for chipkill, i.e., the ability to withstand the failure of an entire DRAM chip), by distributing LED infornnation 120 across a plurality of chips in a memory module. Thus, the LED information 120 may be associated, not only with each cache line as a whole, but with every cache line "segment," i.e., the fraction of the line present in a single chip in the rank.
[0014] A relatively short checksum (1 's complement, Fletcher's sums, or other) may be used as the error detection code, and may be computed over the segment and appended to the data. The error detection code may be based on other types of error detection and/or error protection codes, such as cyclic redundancy check (CRC), Bose, Ray-Chaudhuri, and Hocquenghem (BCH) codes, and so on. This error detection code may be stored in the same memory row, or in a different row to contain such LED information for several cache lines. The layer-1 protection (LED 120) may not only detect the presence of an error, but also pinpoint a location of the error, i.e., locate the chip or other location information associated with the error 104.
[0015] If the LED 120 detects an error, the second layer of protection may be applied, the Global Error Correction (GEC) 130. The GEC 130 may be based on a parity, such as an XOR-based global parity across the data segments 1 19 on the N data chips in the rank 1 12. The GEC 130 also may be based on other error detection and/or error protection codes, such as CRC, BCH, and others. The GEC results may be stored in either the same row as the data segments, or in a separate row that is to contain GEC information for several data rows. Data may be reconstructed based on reading out the fault- free segments and the GEC segment, and location information (e.g., an identification of the failed chip based on the LED 120).
[0016] The LED 120 and GEC 130 may be computed over the data words in a single cache line. Thus, when a dirty line is to be written back to memory from the processor, there is no need to perform a "read-before-write," and both codes can be computed directly, thereby avoiding impacts to write performance. Furthermore, LED 120 and/or GEC 130 may be stored in regular data memory, in view of a commodity memory system that may provide limited redundant storage for Error-Correcting Code (ECC) purposes. An additional read/write operation may be used to access this information along with the processor- requested read/write. Storing LED information in the provided storage space within each row may enable it to be read and written in tandem with the data line. GEC information can be stored in data memory in a separate cache line since it can be accessed in the very rare case of an erroneous data read. Appropriate data mapping can locate this in the same row buffer as the data to increase locality and hit rates.
[0017] The memory controller 102 may provide data mapping, LED 120/GEC 130 computation and verification, perform additional reads if required, etc. Thus, system 100 may provide full functionality transparently, without a need to notify and/or modify an Operating System (OS) or other computing system components. Setting apart some data memory to store LED 120/GEC 130 may be handled through minor modifications associated with system firmware, e.g., reducing a reported amount of available memory storage to accommodate the stored LED 120/GEC 130 transparently from the OS and application perspective.
[0018] FIG. 2 is a block diagram of a memory module 210 according to an example. The memory module 210 may interface with memory controller 202. The memory module 210 may be a Joint Electron Devices Engineering Council (JEDEC)-style double data rate (DDRx, where x = 1 , 2, 3, ...) memory module, such as a Synchronous Dynamic Random Access Memory (SDRAM) configured as a dual in-line memory module (DIMM). Each DIMM may include at least one rank 212, and a rank 212 may include a plurality of DRAM chips 216. Two ranks 212 are shown, each rank 212 including nine chips 216. A rank 212 may be divided into multiple banks 214, each bank distributed across the chips 216 in a rank 212. Although one bank 214 is shown spanning the chips in the rank, a rank may be divided into, e.g., 4-16 banks. The portion of each rank 212/bank 214 in a chip 216 is a segment 219. When the memory controller 202 issues a request for a cache line, the chips 216 in the rank 212 are activated and each segment 219 contributes a portion of the requested cache line. Thus, a cache line is striped across multiple chips 216. [0019] In an example having a data bus width of 64 bits, and a cache line of 64 bytes, the cache line transfer can be realized based on a burst of 8 data transfers. A chip may be an xN part, e.g., x4, x8, x16, x32, etc. Each segment of a bank 214 may be partitioned into N arrays 218 (four are shown). Each array 218 can contribute a single bit to the N-bit transfer on the data I/O pins for that chip 216. Thus, for a given DIMM capacity, employing wider I/O DRAM parts such as x8, x16, or x32 may decrease the number of DRAM chips 216 needed to achieve a given data bus width, creating extra space on the DIMM for more chips 216, thereby increasing the number of independent banks 214 available.
[0020] Each chip 216 may be used to store data 21 1 , information from LED 220, and information from GEC 230. Accordingly, each chip 216 may contain a segment 219 of data 21 1 , LED 220, and GEC 230 information. This is in contrast to how a conventional 9-chip ECC memory module is used, where 8 chips are used for data and the 9th chip is used for ECC information. Accordingly, the exemplary uses described herein provide robust chipkill protection, because each chip can include the data 21 1 , LED 220, and GEC 230 for purposes of identifying and correcting errors. The example of FIG. 2 illustrates functionality with just a single rank of nine x8 chips, improving access granularity, energy consumption, and performance. Further, the example can support chipkill protection at very high ratios, such as the ability to handle 1 dead chip in 9, significantly boosting reliability guarantee (in contrast to conventional support of, e.g., 1 dead chip in 36).
[0021] Examples described herein can allow several-fold reduction in the number of chips activated per memory access. This helps reduce dynamic energy consumption by eliminating overfetch at least to that extent, and helps reduce static energy consumption by allowing unused chips to be put in low- power modes. In addition to the energy advantages, reducing access granularity increases rank-level and bank-level parallelism. This enables substantial performance gains. Examples described herein impose no restrictions on DRAM parts, DIMM layout, DDR protocol, burst length, etc., and may be adapted to x8, x16 or x32 DRAMs, allowing servers to exploit advantages of those memory configurations. Examples may be achieved with non-intrusive modifications to system designs, because an example memory controller, and to a smaller extent a memory firmware, may be modified to provide support for the examples. Examples may utilize additive checksums and parity calculations to provide strong fault tolerance without a need for Galois field arithmetic over 16-bit or 32-bit symbols or other increased complexity, latency, and energy consumption.
[0022] Thus, examples herein provide benefits without a need for a specially designed DRAM chip microarchitecture, e.g., a DRAM having a special area provisioned to store ECC information and utilizing a localized data mapping architecture that would impose significant penalties on write performance if chipkill-level reliability is enabled. Further, there is no need for using conventional symbol-based ECC codes that have constraints with respect to DIMM and rank organization. Examples are implementation friendly, without a need for modifications to several components of a computing system, because examples may be transparent to a computing system's operating system, memory management unit, caches, etc.
[0023] FIG. 3 is a block diagram of a memory module rank 312 according to an example. The rank 312 may include N chips, e.g., nine x8 DRAM chips 316 (chip 0 ... chip 8), and a burst length of 8. In alternate examples, other numbers/combinations of N chips may be used, at various levels of xN and burst lengths. The data 31 1 , LED 320, and GEC 330 can be distributed throughout the chips 316 of the rank 312.
[0024] LED 320 can perform an immediate check following every read operation to verify data fidelity. Additionally, LED 320 can identify a location of the failure, at a chip-granularity within rank 312. To ensure such chip-level detection (usable for chipkill), the LED 320 can be maintained at the chip level - associated with more specificity than an entire cache line as a whole (as in symbol-based ECC codes), at every cache line "segment," the fraction of the line present in a single chip 316 in the rank 312. Cache line A is divided into segments A0 through A8, with the associated local error detection codes LAo through LA8. [0025] A cache line may be associated with 64 bytes of data, or 512 data bits, associated with a data operation, such as a memory request. Because 512 data bits (one cache line) in total are needed, each chip is to provide 57 bits towards the cache line. An x8 chip with a burst length of 8 supplies 64 bits per access, which are interpreted as 57 bits of data (A0 in Figure 3, for example), and 7 bits of LED information 320 associated with those 57 bits (LAo)- A physical data mapping policy may be used to ensure that LED bits 320 and the data segments 31 1 they protect are located on the same chip 316. One bit of memory appears to remain unused for every 576 bits, since 57 bits of data multiplied by 9 chips is 513 bits, and only 512 bits are needed to store the cache line. However, this "surplus bit" is used as part of the second layer of protection (e.g., GEC) details of which are described in reference to FIG. 4.
[0026] There are no performance penalties on either reads or writes due to the LED code 320. Every cache line access also reads/writes its corresponding LED information. Since the LED 320 is "self-contained," i.e., it is constructed from bits belonging to exactly one cache line, no read-before-write is needed - all bits used to build the code are already at the memory controller before a write. The choice of error detection code for the LED 320 can depend on an expected failure mode. For example, a simple 1 's complement addition checksum may be used for a range of expected failure modes, including the most common/frequent modes of memory failure.
[0027] The GEC 330, also referred to as a Layer 2 Global Error Correction code, is to aid in the recovery of lost data once the LED 320 (Layer 1 code) detects an error and indicates a location of the error. The Layer 2 GEC 330 may be comprised of three tiers. The GEC 330 code may be a 57-bit entity, and may be provided as a column-wise XOR parity of nine cache line segments, each a 57-bit field from the data region. For cache line A, for example, its GEC 330 may be a parity, such as a parity PA that is a XOR of data segments A0, A-i , A8. Data reconstruction from the GEC 330 code may be a non-resource intensive operation (e.g., an XOR of the error-free segments and the GEC 330 code), as the erroneous chip 316 can be flagged by the LED 320. Because there isn't a need for an additional dedicated ECC chip (what is normally used as an ECC chip on a memory module rank 312 is instead used to store data + LED 320), the GEC code may be stored in data memory itself, in contrast to using a dedicated ECC chip. The available memory may be made to appear smaller than it physically is (e.g., by 12.5% overhead for storing LED 320 and/or GEC 330) from the perspective of the operating system, via firmware modifications or other techniques. The memory controller also may be aware of the changes to accommodate the LED 320 and/or GEC 330, and may map data accordingly (such as mapping to make the LED 320 and/or GEC 330 transparent to the OS, applications, etc.).
[0028] In order to provide strong fault-tolerance of one dead chip 316 in nine for chipkill, and to minimize the number of chips 316 touched on each access, the GEC 330 code may be placed in the same rank as its corresponding cache line. A specially-reserved region (lightly shaded GEC 330 in FIG. 3) in each of the nine chips 316 in the rank 312 may be set aside for this purpose. The specially-reserved region may be a subset of cache lines in every DRAM page (row), although it is shown as a distinct set of rows in FIG. 3 for clarity. This co- location may ensure that any reads or writes to the GEC 330 information will be guaranteed to produce a row-buffer hit when made in conjunction with the read or write to the actual data cache line, thus reducing any potential impacts to performance.
[0029] FIG. 4 is a block diagram of a cache line 413 including a surplus bit 436 according to an example. The GEC 430 information may be laid out in a reserved region across N chips (e.g., Chip 0...8), for an example as cache line A, also illustrated in FIG. 3. The cache line 413 also may include parity 432, tiered parity 434, and surplus bit 436.
[0030] Similar to the data bits as shown in FIG. 3, the 57-bit GEC 430 may be distributed among all N (i.e., nine) chips 419. The first seven bits of the PA field (ΡΑο-β) may be stored in the first chip 416 (Chip 0), the next seven bits (PA7-13) may be stored in the second chip (Chip 1 ), and so on. Bits PA49-55 may be stored on the eighth chip (Chip 7). The last bit, PA56 may be stored on the ninth chip (Chip 8), in the surplus bit 436. The surplus bit 436 may be borrowed from the Data+LED region of the Nth chip (Chip 8), as set forth above regarding using only 512 bits of the available 513 bits (57 bits x 9 chips) to store the cache line.
[0031] The failure of a chip 416 also results in the loss of the corresponding bits in the GEC 430 information stored in that chip. The GEC 430 code PA itself, therefore, is protected by an additional parity 432, also referred to as the third tier PPA. PPA in the illustrated example is a 7-bit field, and is the XOR of the N-1 other 7-bit fields, PA0-6, PA7-i3> .... PA^-ss- The parity 432 (PPA field) is shown stored on the Nth (ninth) chip (Chip 8). If an entire chip 416 fails, the GEC 430 is first recovered using the parity 432 combined with uncorrupted GEC segments from the other chips. The chips 416 that are uncorrupted may be determined based on the LED, which can include an indication of an error's location, i.e., locate the failed chip). The full GEC 430 is then used to reconstruct the original data.
[0032] In addition to a fully failed chip error, there may be an error in a second chip. Examples described herein enable detection, if not correction, of such a failure under the various fault models. If the second error is also a full- chip failure, it will be detected by the LED along with the initial data read, and flagged as a doublechip failure. However, if the second error occurs just in the GEC 430 row of interest, it can be detected during the GEC phase.
[0033] In an example failure scenario, assume that the second chip has completely failed - A-i , and PA7-13 would be lost. If, in addition, there is an error in the GEC region of the first chip, there is a possibility that one or more of the bits PA0-6 are corrupt. The reconstruction of lost bits PA7_ 3 from PPA 432 and PA0-6, PA-i4-2o, PA21-27, PA56 may itself be incorrect. To handle this problem, tiered parity 434 is used, e.g., the remaining 9 bits of the nine chips 416 (marked T4, for Tier-4, in FIG. 4) are used to build an error detection code across GEC bits PA0 through PA55, and PPA. Note that neither exact error location information nor correction capabilities are required at this stage, because the reliability target is only to detect a second error, and not necessarily correct it. A code, therefore, may be built using various permutations of bits from the different chips to form each of the T4 bits 434. This can include multiple bits from the same chip 416, and bits from different columns across chips 416 to maximize the probability of detection.
[0034] In another example, consider a single cache line A. Recall that chips 0-7 (without loss of generality, e.g., N-1 chips) can contain 57 bits of data plus 7 bits of LED in the data region, and 7 bits of GEC 430 parity plus 1 bit of T4 information (tiered parity 434) in the GEC region. Chip-8 (the Nth chip) can contain 56 bits of data plus 7 bits of LED plus one surplus bit 436 in the data region, and 8 bits of parity (including the surplus bit borrowed from the data region) plus one bit of T4 information in the GEC region.
[0035] If one of the first eight chips, e.g., the second chip, fails, 57 bits of data (A-i ) are lost, in addition to GEC parity information PA7- 3. The lost information can be recovered by reading A0 - A8, and the LED associated with A1 (LAi ), indicates a chip error. Read GEC segments PA0-6, PA-i4-2o, PA21-27, ■ PA49-55, PA56 and PPA to recover the lost GEC bits PA7-i3, thereby reconstructing GEC parity PA. Combined with values A0 and A2 - A7, data value A-i can be reconstructed, thus recovering the entire original cache line. If, however, the ninth chip were to fail, only 56 bits of data would be lost (A8), in addition to PPA, and the surplus bit PA56. The lost 56 bits can be recovered from the 56 columns of parity stored in the first eight chips (PA0-55), thus recovering the entire original cache line. The loss of surplus bit PA56 is immaterial. Across these cases, the fidelity of the GEC parity bits themselves is guaranteed by tiered parity 434 T4.
[0036] Read operations need not access GEC 430 information unless an error is detected, which is a rare event. GEC 430 therefore has no significant impact on reads. As for write operations, the GEC 430 may be updated (which includes Ρχ, ΡΡχ, and T4) when data is modified. In a baseline implementation, each cache line write may be transformed into two writes - one to the data location (for a full 576 bits of data + LED + surplus bit) and another to its corresponding GEC location (72-bits). Although 72 bits of GEC+T4 code may be updated per write, other constraints (e.g., the DDR3 protocol) may be associated with completing a burst of 8 per access (e.g., an entire 72-byte "cache line" size of data). Thus, updates may be combined, e.g., as many as 8 different GEC updates into a single write command, to reduce some of the performance impact. This is low-overhead since writes are already buffered and streamed out intermittently from the memory controller, and additional logic can easily be implemented at this stage to coalesce as many GEC writes as possible. Performance impact is further minimized because the data mapping ensures that the GEC write is a row-buffer hit once the data line is written. Additionally, note that there is not a need for a read-before-write of the data cache lines themselves, because bits contributing to the GEC code are from a single cache line, already available at the controller. This further minimizes performance impact. If complete coalescing is not implemented (based on the addresses being written to), data masking can be employed to write the appropriate bits into memory. Note that the complete burst of 8 may be performed nonetheless - some pieces of data are just masked out while actually writing to DRAM.
[0037] With all these considerations, every write is transformed into 1 + δ writes, for some fraction δ < 1 dependent on the access characteristics of the application. Note that δ = 1 in a non-coalesced baseline implementation, and δ = 0.125 in an oracular design because eight GEC words fit in a single "cache line," and could potentially be coalesced into a single write.
[0038] In an example implementation for nine chips (N=9), for each 64-byte (512-bit) cache line in a rank with nine x8 chips, the following bits may be used: 63 bits of LED information, at 7 bits per chip; 57 bits of GEC parity, spread across the nine chips; 7 bits of third-level parity, PPX; and 9 bits of T4 protection, 1 bit per chip. This adds up to a total of 136 bits out of 512 bits of the cache line, a storage overhead of 26.5%. Out of this 26.5%, 12.5% may be provided by the 9th chip added on to standard ECC DIMMs (e.g., making the 9th chip available for general use, instead of reserving it for standard ECC-only operation), and the other 14% is stored in data memory in the GEC region.
[0039] The examples described herein may be applied to wider-l/O DRAM parts, which are associated with greater power efficiency and greater rank-level parallelism. A specific example will be provided for x16 DRAMs, and similar techniques may be used for extending the concepts to x32 DRAMs and beyond. [0040] Consider a rank of nine x16 DRAMs. The 128 bits supplied by an x16 DRAM in a burst of 8 may be interpreted as 1 14 data bits and 14 checksum LED bits, having a storage overhead similar to using x8 DRAMs. GEC operation may remain unchanged. While there may be an increase in access granularity and overfetch, storage overhead may remain constant at approximately 25% (LED + GEC).
[0041] If access granularity is fixed at exactly one cache line (i.e., a 64-bit bus), the minimum rank size with x16 chips is 5 chips (4 data plus 1 ECC). Each chip provides 128 bits per burst of 8, interpreted as 103 data bits (since 103 * 4 chips = 512-bit cache line). This leaves 25 bits per chip to store the LED code, which provides very strong error protection, but may be wasteful of storage area (the overhead would be 24%). GEC overhead may increase as well, because the global parity is a 103-bit entity computed over four 103-bit data segments, a storage overhead of 25%, with total overhead of approximately 50%.
[0042] If storage overhead is prioritized, it can be fixed at about 12.5%, with a tradeoff of an increase in access granularity. With x16 chips and a 5-chip rank, for example, 9 reads can be issued consecutively, reading out a total of 80 bits per cycle * burst of 8 cycles * 9 accesses = 5,760 bits. This results in a very large access granularity of 10 cache lines (5120 bits) plus their LED codes, with a storage overhead of 12.5%. The GEC overhead remains approximately 25% (1 in 4 chips), for an overall ECC storage overhead of 37.5%.
[0043] If neither access granularity nor storage overhead is to be compromised, but there is freedom to implement a custom DIMM, the use of heterogeneous DRAMs within a single DIMM rank may be used. In this case, minimum access granularity can be maintained while still retaining a 12.5% storage overhead. With x16 parts, for instance, a minimum-sized rank would be four x16 DRAMs plus one x8 DRAM (note that the DRAMs are still commodity, just not the DIMM), providing a DIMM width of 72 bits. With a burst length of 8, each x16 DRAM supplies 128 bits and the x8 DRAM supplies 64 bits. These should be interpreted as (1 14 data + 14 LED) and (56 data + 8 LED) respectively. There would be no change to GEC overhead or operation. [0044] Thus, there are several options to be varied, including the storage overhead, the importance of access granularity (typically a function of access locality in the workload), and the willingness to build heterogeneous DIMMs - as wide I/O parts such as x16 or x32 become mainstream due to their reduced power consumption. Examples described herein are flexible enough to be effective in designs with varying combinations and variations of these options.
[0045] Substantial power savings may be realized, compared to traditional chipkill mechanisms, through a reduction of both dynamic and static power. It is possible to activate the absolute minimum number of chips required to service a request, e.g., just nine x8 chips, for example, reading/writing exactly one 64- byte cache line in a standard 8-burst access. This is in contrast to conventional chipkill solutions that may cause forced prefetching and increase dynamic power consumption (e.g., by activating additional chips per read/write, accessing multiple cache lines per standard 8-burst access). Examples provided herein also may enable a reduction in activate power, because the size of the row buffer per chip may be constant, but fewer chips are being activated. Activation power also may be reduced going from x4 chips to x8 chips, because fewer chips make up a rank. The footprint of each activation also may be reduced, allowing unused rank/banks to transition into low-power modes, such as shallow low-power modes that can be entered into and exited from quickly.
[0046] In addition to the large energy advantage, reducing access granularity also has a positive effect on performance. For a given total number of chips in the system, there is increased rank-level and bank-level parallelism. This can reduce bank conflicts and overall average memory access latency. A fraction of this gain may be lost due to the extra writes to GEC lines required along with the regular writes. Despite this overhead, examples may still come out ahead, even without coalescing.
[0047] FIG. 5 is a flow chart 500 based on checking data fidelity according to an example. In block 510, local error detection (LED) information is computed per cache line segment of data associated with a rank of a memory, based on an error detection code. In block 520, a global error correction (GEC) is generated for the cache line segment based on a global parity. In block 530, data fidelity is checked in response to a memory read operation, based on the LED information, to identify a presence of an error and the location of the error among cache line segments of the rank. In block 540, the cache line segment having the error is corrected based on the GEC, in response to identifying the error.
[0048] FIG. 6 is a flow chart 600 based on performing error detection and/or correction according to an example. In block 610, a local error detection (LED) is performed in response to a memory read operation, based on a checksum computed over a cache line segment, to detect a location of an error at a chip- granularity among N data chips in a rank. In block 620, a global error correction (GEC) is performed over the cache line segment on the N data chips in the rank in response to detecting the error, the GEC performed using a global parity to generate GEC information. In block 630, data segments having the error are reconstructed, based on error-free segments and the GEC information. In block 640, the GEC information is updated in response to a write operation. In block 650, a tiered parity is generated to protect the GEC information, wherein the tiered parity is stored on an Nth chip and is to be used to recover the GEC information based on GEC information segments from a plurality of chips.

Claims

WHAT IS CLAIMED IS:
1 . A method, comprising:
computing local error detection (LED) information per cache line segment of data associated with a rank of a memory, based on an error detection code; generating a global error correction (GEC) for the cache line segment based on an error correction code;
checking data fidelity in response to a memory read operation, based on the LED information, to identify a presence of an error and the location of the error among cache line segments of the rank; and
correcting the cache line segment having the error based on the GEC, in response to identifying the error.
2. The method of claim 1 , further comprising coalescing a plurality of GEC updates, associated with adjacent cache lines, to be sent together.
3. The method of claim 1 , wherein the error detection code is to identify the presence of an error and the location of the error within a cache line segment of the rank.
4. The method of claim 1 , further comprising storing the LED information and the GEC computed for the cache line segment at the cache line segment associated with the data.
5. The method of claim 1 , further comprising storing the LED information and the GEC computed for the cache line segment in a memory row different from a memory row associated with the data.
6. A method, comprising:
performing a local error detection (LED) in response to a memory read operation, based on an error detection code computed over a cache line segment, to detect a location of an error at a chip-granularity among N data chips in a rank;
performing a global error correction (GEC) over the cache line segment on the N data chips in the rank in response to detecting the error, the GEC based on an error correction code to generate GEC information; and
reconstructing data segments having the error, based on error-free segments and the GEC information.
7. The method of claim 6, wherein a plurality of GEC updates to adjacent cache lines are coalesced and sent together.
8. The method of claim 6, wherein the error correction code is based on N cache line segments.
9. The method of claim 6, further comprising updating the GEC information in response to a write operation.
10. The method of claim 6, further comprising storing the GEC information in a row buffer of the corresponding cache line, in a reserved region in each of the N chips.
1 1 . The method of claim 6, further comprising storing data and corresponding LED information on each chip of the rank, based on a physical data mapping policy; and providing the data and LED information in response to a cache line access request.
12. The method of claim 6, further comprising generating a tiered error correction code to protect the GEC information, wherein the tiered error correction code is stored on an Nth chip and is to be used to recover the GEC information based on GEC information segments from a plurality of chips.
13. The method of claim 12, further comprising identifying an uncorrectable double-chip failure, based on detecting, during a GEC phase, an error in the GEC row of interest based on the tiered error correction code.
14. A memory controller to:
verify data fidelity, in response to a read operation, based on local error detection (LED) information for a cache line segment of data associated with a rank of a memory;
identify a presence and a location of an error among cache line segments of the rank according to the LED information;
generate a global error correction (GEC) for the cache line segment based on an error correction code; and
correct the cache line segment having the error based on the GEC, in response to identifying the error.
15. The memory controller of claim 14, wherein the LED and GEC information is mapped according to firmware information associated with the memory controller.
PCT/US2012/040108 2012-05-31 2012-05-31 Local error detection and global error correction WO2013180714A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
US14/396,327 US9600359B2 (en) 2012-05-31 2012-05-31 Local error detection and global error correction
PCT/US2012/040108 WO2013180714A1 (en) 2012-05-31 2012-05-31 Local error detection and global error correction
KR1020147030518A KR101684045B1 (en) 2012-05-31 2012-05-31 Local error detection and global error correction
CN201280072858.8A CN104246898B (en) 2012-05-31 2012-05-31 local error detection and global error correction
EP12877868.5A EP2856471A4 (en) 2012-05-31 2012-05-31 Local error detection and global error correction
TW102117744A TWI501251B (en) 2012-05-31 2013-05-20 Local error detection and global error correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2012/040108 WO2013180714A1 (en) 2012-05-31 2012-05-31 Local error detection and global error correction

Publications (1)

Publication Number Publication Date
WO2013180714A1 true WO2013180714A1 (en) 2013-12-05

Family

ID=49673762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2012/040108 WO2013180714A1 (en) 2012-05-31 2012-05-31 Local error detection and global error correction

Country Status (6)

Country Link
US (1) US9600359B2 (en)
EP (1) EP2856471A4 (en)
KR (1) KR101684045B1 (en)
CN (1) CN104246898B (en)
TW (1) TWI501251B (en)
WO (1) WO2013180714A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2915045B1 (en) * 2012-11-02 2019-01-02 Hewlett-Packard Enterprise Development LP Selective error correcting code and memory access granularity switching
JP6140093B2 (en) * 2014-03-18 2017-05-31 株式会社東芝 Cache memory, error correction circuit, and processor system
US9600189B2 (en) 2014-06-11 2017-03-21 International Business Machines Corporation Bank-level fault management in a memory system
KR102131337B1 (en) * 2014-10-20 2020-07-07 한국전자통신연구원 Cache memory with fault tolerance
US10467092B2 (en) * 2016-03-30 2019-11-05 Qualcomm Incorporated Providing space-efficient storage for dynamic random access memory (DRAM) cache tags
CN109074851B (en) 2016-05-02 2023-09-22 英特尔公司 Internal Error Checksum Correction (ECC) utilizing additional system bits
US10268541B2 (en) 2016-08-15 2019-04-23 Samsung Electronics Co., Ltd. DRAM assist error correction mechanism for DDR SDRAM interface
US10769540B2 (en) * 2017-04-27 2020-09-08 Hewlett Packard Enterprise Development Lp Rare event prediction
KR101934204B1 (en) * 2017-07-28 2018-12-31 한양대학교 산학협력단 Erasure Coding Method and Apparatus for Data Storage
US10372535B2 (en) * 2017-08-29 2019-08-06 Winbond Electronics Corp. Encoding method and a memory storage apparatus using the same
US10606692B2 (en) 2017-12-20 2020-03-31 International Business Machines Corporation Error correction potency improvement via added burst beats in a dram access cycle
US11010234B2 (en) 2019-02-01 2021-05-18 Winbond Electronics Corp. Memory device and error detection method thereof
CN113424262B (en) * 2019-03-21 2024-01-02 华为技术有限公司 Storage verification method and device
KR20200117129A (en) * 2019-04-03 2020-10-14 삼성전자주식회사 Semiconductor memory device, and memory system having the same
US20210306006A1 (en) * 2019-09-23 2021-09-30 SK Hynix Inc. Processing-in-memory (pim) devices
JP7018089B2 (en) * 2020-04-02 2022-02-09 ウィンボンド エレクトロニクス コーポレーション Semiconductor storage device and readout method
US11301325B2 (en) * 2020-05-29 2022-04-12 Intel Corporation Memory in integrity performance enhancement systems and methods
US11640336B2 (en) * 2020-07-24 2023-05-02 Seagate Technology Llc Fast cache with intelligent copyback
US20220207190A1 (en) * 2020-12-26 2022-06-30 Intel Corporation Low overhead memory integrity with error correction capabilities
JP2022137811A (en) * 2021-03-09 2022-09-22 キオクシア株式会社 Information processing system, storage device, and host
JP7253594B2 (en) * 2021-08-27 2023-04-06 ウィンボンド エレクトロニクス コーポレーション semiconductor storage device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4875212A (en) * 1985-10-08 1989-10-17 Texas Instruments Incorporated Memory device with integrated error detection and correction
US20010014039A1 (en) * 1999-04-05 2001-08-16 Michael L. Longwell Memory tile for use in a tiled memory
US20050015649A1 (en) * 2003-06-27 2005-01-20 International Business Machines Corp. Method and system for correcting errors in a memory device
US20050172207A1 (en) * 2004-01-30 2005-08-04 Radke William H. Error detection and correction scheme for a memory device
US20090006886A1 (en) * 2007-06-28 2009-01-01 International Business Machines Corporation System and method for error correction and detection in a memory system
EP2261806A1 (en) 2008-02-28 2010-12-15 Fujitsu Limited Storage device, storage controller, data transfer integrated circuit, and method of controlling storage

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304992B1 (en) * 1998-09-24 2001-10-16 Sun Microsystems, Inc. Technique for correcting single-bit errors in caches with sub-block parity bits
US6249475B1 (en) 1999-04-05 2001-06-19 Madrone Solutions, Inc. Method for designing a tiled memory
US20020069318A1 (en) * 2000-12-01 2002-06-06 Chow Yan Chiew Real time application accelerator and method of operating the same
CA2447204C (en) 2002-11-29 2010-03-23 Memory Management Services Ltd. Error correction scheme for memory
US20040225944A1 (en) * 2003-05-09 2004-11-11 Brueggen Christopher M. Systems and methods for processing an error correction code word for storage in memory components
US7149945B2 (en) 2003-05-09 2006-12-12 Hewlett-Packard Development Company, L.P. Systems and methods for providing error correction code testing functionality
JP2005293728A (en) 2004-03-31 2005-10-20 Toshiba Corp Semiconductor memory device
US7308638B2 (en) * 2004-06-29 2007-12-11 Hewlett-Packard Development Company, L.P. System and method for controlling application of an error correction code (ECC) algorithm in a memory subsystem
US7437651B2 (en) * 2004-06-29 2008-10-14 Hewlett-Packard Development Company, L.P. System and method for controlling application of an error correction code (ECC) algorithm in a memory subsystem
US20060143551A1 (en) 2004-12-29 2006-06-29 Intel Corporation Localizing error detection and recovery
US20060236035A1 (en) * 2005-02-18 2006-10-19 Jeff Barlow Systems and methods for CPU repair
US8055982B2 (en) 2007-02-21 2011-11-08 Sigmatel, Inc. Error correction system and method
US8041989B2 (en) 2007-06-28 2011-10-18 International Business Machines Corporation System and method for providing a high fault tolerant memory system
US7747903B2 (en) 2007-07-09 2010-06-29 Micron Technology, Inc. Error correction for memory
US8176391B2 (en) * 2008-01-31 2012-05-08 International Business Machines Corporation System to improve miscorrection rates in error control code through buffering and associated methods
KR20100012605A (en) 2008-07-29 2010-02-08 삼성전자주식회사 Non-volatile memory device and method for program using ecc
US8321758B2 (en) * 2008-08-05 2012-11-27 Advanced Micro Devices, Inc. Data error correction device and methods thereof
US8086783B2 (en) 2009-02-23 2011-12-27 International Business Machines Corporation High availability memory system
US7856528B1 (en) * 2009-08-11 2010-12-21 Texas Memory Systems, Inc. Method and apparatus for protecting data using variable size page stripes in a FLASH-based storage system
JP4940322B2 (en) * 2010-03-16 2012-05-30 株式会社東芝 Semiconductor memory video storage / playback apparatus and data writing / reading method
US8762813B2 (en) 2010-05-17 2014-06-24 Skymedi Corporation Configurable coding system and method of multiple ECCS
US8775868B2 (en) * 2010-09-28 2014-07-08 Pure Storage, Inc. Adaptive RAID for an SSD environment
US8640006B2 (en) * 2011-06-29 2014-01-28 International Business Machines Corporation Preemptive memory repair based on multi-symbol, multi-scrub cycle analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4875212A (en) * 1985-10-08 1989-10-17 Texas Instruments Incorporated Memory device with integrated error detection and correction
US20010014039A1 (en) * 1999-04-05 2001-08-16 Michael L. Longwell Memory tile for use in a tiled memory
US20050015649A1 (en) * 2003-06-27 2005-01-20 International Business Machines Corp. Method and system for correcting errors in a memory device
US20050172207A1 (en) * 2004-01-30 2005-08-04 Radke William H. Error detection and correction scheme for a memory device
US20090006886A1 (en) * 2007-06-28 2009-01-01 International Business Machines Corporation System and method for error correction and detection in a memory system
EP2261806A1 (en) 2008-02-28 2010-12-15 Fujitsu Limited Storage device, storage controller, data transfer integrated circuit, and method of controlling storage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2856471A4

Also Published As

Publication number Publication date
US9600359B2 (en) 2017-03-21
EP2856471A4 (en) 2015-11-18
CN104246898B (en) 2017-03-22
KR20140140632A (en) 2014-12-09
EP2856471A1 (en) 2015-04-08
US20150082122A1 (en) 2015-03-19
TW201407629A (en) 2014-02-16
KR101684045B1 (en) 2016-12-07
TWI501251B (en) 2015-09-21
CN104246898A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
US9600359B2 (en) Local error detection and global error correction
CN109426583B (en) Running RAID parity computation
US8086783B2 (en) High availability memory system
Udipi et al. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems
US9754684B2 (en) Completely utilizing hamming distance for SECDED based ECC DIMMs
US9898365B2 (en) Global error correction
US9183078B1 (en) Providing error checking and correcting (ECC) capability for memory
US8874979B2 (en) Three dimensional(3D) memory device sparing
US20140063983A1 (en) Error Detection And Correction In A Memory System
US10193576B2 (en) Memory system and memory device
US20130339820A1 (en) Three dimensional (3d) memory device sparing
Mittal et al. A survey of techniques for improving error-resilience of DRAM
US20210359704A1 (en) Memory-mapped two-dimensional error correction code for multi-bit error tolerance in dram
US20040225944A1 (en) Systems and methods for processing an error correction code word for storage in memory components
Kwon et al. Understanding ddr4 in pursuit of in-dram ecc
US20160139988A1 (en) Memory unit
US9106260B2 (en) Parity data management for a memory architecture
Jian et al. High performance, energy efficient chipkill correct memory with multidimensional parity
US20160147598A1 (en) Operating a memory unit
US11994946B2 (en) Memory bank protection
Sim et al. A configurable and strong RAS solution for die-stacked DRAM caches
Wu et al. Redundant Array of Independent Memory Devices
CN116954982A (en) Data writing method and processing system
Wu et al. ECC TECHNIQUES FOR ENABLING DRAM CACHES WITH OFF-CHIP TAG ARRAYS.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12877868

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2012877868

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 14396327

Country of ref document: US

ENP Entry into the national phase

Ref document number: 20147030518

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE