Memory scrubbing

From Wikipedia, the free encyclopedia

Memory scrubbing is the process of detecting and correcting bit errors in memory by using error-detecting codes like ECC.

Contents

[edit] Background: Soft errors

Due to the high integration density of modern computer memory, the memory cell structures are vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors (see article for details). This can be a problem for DRAM as well as SRAM based memory. Together with the large amount of memory with which modern computers - especially servers - are equipped nowadays, the increased probability of soft errors may become a problem.

[edit] ECC and scrubbing

Memory can be equipped with ECC for additional error detection and correction. ECC is capable of detecting two or correcting one wrong bit per memory word (usually 64 bits). By scanning systematically through memory - called memory scrubbing - this technique can be used for the search for single bit errors caused by soft errors. The memory controller reads data from memory, checks its ECC checksum and if a bit has flipped, calculates which one and writes the correct data back to memory.

[edit] Operation

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods of the memory. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many server boards, the scrub period can be configured in the BIOS setup program.

As the occurrence of one soft error can be corrected without any problems, the occurrence of more than one within a memory word is generally not correctable. It is therefore important to check every memory location periodically. The normal memory reads issued by the CPU or DMA devices are checked for ECC errors anyway, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

On some systems, not only main memory (DRAM-based) is capable of scrubbing but also CPU-caches (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than main memory, the scrubbing for caches does not need to happen so frequently.

Memory Scrubbing increases reliability and can therefore be classified as a RAS-Feature.

[edit] See also