Error-correcting code memory keeps single-bit errors at bay
Errant Bits
Data protection and checking occurs in various places throughout a system. Some of it happens in hardware and some of it happens in software. The goal is to ensure that data is not corrupted (changed), either coming from or going to the hardware or in the software stack. One key technology is ECC memory (error-correcting code memory) [1].
The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit errors, it cannot correct them. A simple flip of one bit in a byte can make a drastic difference in the value of the byte. For example, a byte (8 bits) with a value of 156 (10011100) that is read from a file on disk suddenly acquires a value of 220 if the second bit from the left is flipped from a 0 to a 1 (11011100) for some reason.
ECC memory can detect the problem and correct it, while the user is unaware. Note, however, that only one bit in the byte has been changed and then corrected. If two bites change – perhaps both the second and seventh from the left – the byte is now 11011110 (i.e., 222). Typical ECC memory can detect that the "double-bit" error occurred, but it cannot correct it. In fact, when a double-bit error happens, memory should cause what is called a "machine check exception" (mce), which should cause the system to crash.
After all, you are using ECC memory, so ensuring that the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop. The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system.
This interference can cause a bit to flip at seemingly random times, depending on the circumstances. According to a Wikipedia article [1] and a paper on single-event upsets in RAM [2], most single-bit flips are the result of background radiation – primarily neutrons from cosmic rays.
The same Wikipedia article said that the error rates reported from 2007 to 2009 varied all over the map, ranging from 1010 (errors/bit-hr) to 1017 (seven orders of magnitude difference). The lower number is just about one error per gigabit of memory per hour. The upper number indicates roughly one error every 1,000 years per gigabit of memory.
The observations in a study by Google (see the "Correctable Errors" box) indicate that in real-world production, you see much higher error rates than what manufacturers are reporting. Moreover, the rate of correctable errors can be an important factor in watching for memory failure. Consequently, I think monitoring and capturing the correctable error information is very important.
Linux and Memory Errors
When I worked for Linux Networx years ago, they were helping with a project that was called bluesmoke [4]. The idea was to have a kernel module that could catch and report hardware-related errors within the system. This goes beyond just memory errors and includes hardware errors in the cache, DMA, fabric switching, thermal throttling, hypertransport bus, and so on. The formal name of the project was EDAC [5], Error Detection and Correction.
For many years, people wrote EDAC kernel modules for various chipsets so they could capture hardware-related error information and report it. This was initially done outside of the kernel at the beginning of the project, but, starting with kernel 2.6.16 (released March 20, 2006), edac
was included with the kernel. Starting with kernel 2.6.18, EDAC showed up in the /sys
filesystem, typically in /sys/devices/system/edac
.
One of the best sources of information about EDAC can be found at the EDAC wiki [6]. The page discusses how to get started and is also a good location for EDAC resources (bugs, FAQs, mailing list, etc.).
Rather than focusing on getting EDAC working, however, I want to focus on what information it can provide and why it is important. I'll be using a Dell PowerEdge R720 as an example system. It has two processors (Intel E5-2600 series) and 128GB of ECC memory. It was running CentOS 6.2 during the tests.
For the test system, I checked with lsmod
to see whether any EDAC modules were loaded:
login2$ /sbin/lsmod ... sb_edac 12898 0 edac_core 46773 3 sb_edac ...
EDAC was loaded as a module, so I examined the directory /sys/devices/system/edac
:
login2$ ls -s /sys/devices/ system/edac/ total 0 0 mc
Because I can only see the mc
devices, EDAC is only monitoring the memory controller(s). If I probe a little further,
login2$ ls -s /sys/devices/ system/edac/mc total 0 0 mc0 0 mc1
I find two EDAC components for this system, mc0 and mc1. Listing 1 shows what I see when I peer into mc0. The two memory controllers for this system control a number of DIMMs, which are laid out in a "chip-select" row (csrow) and a channel table (chx) (see the EDAC documentation for more details [7]). You can have multiple csrow values and multiple channels. For example, Listing 2 shows a simple ASCII sketch of two csrows and two channels.
Listing 1: Content of a Memory Controller
login2$ ls -s /sys/devices/system/edac/mc/mc0 total 0 0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb 0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count 0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
Listing 2: csrows and Channels
Channel 0 Channel 1 ============================== csrow0 | DIMM_A0 | DIMM_B0 | csrow1 | DIMM_A0 | DIMM_B0 | ============================== ============================== csrow2 | DIMM_A1 | DIMM_B1 | csrow3 | DIMM_A1 | DIMM_B1 | ==============================
The number of csrows depends on the electrical loading of a given motherboard and the memory controller and DIMM characteristics.
An Example
For this example node, each memory controller has eight csrows and one channel table. You can get an idea of the layout by looking at the entries for csrowX (X = 0 to 7) in Listing 3.
Listing 3: Memory Controller Layout
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label CPU_SrcID#0_Channel#0_DIMM#0 login2$ more /sys/devices/system/edac/mc/mc0/csrow1/ch0_dimm_label CPU_SrcID#0_Channel#0_DIMM#1 login2$ more /sys/devices/system/edac/mc/mc0/csrow2/ch0_dimm_label CPU_SrcID#0_Channel#1_DIMM#0 login2$ more /sys/devices/system/edac/mc/mc0/csrow3/ch0_dimm_label CPU_SrcID#0_Channel#1_DIMM#1 login2$ more /sys/devices/system/edac/mc/mc0/csrow4/ch0_dimm_label CPU_SrcID#0_Channel#2_DIMM#0 login2$ more /sys/devices/system/edac/mc/mc0/csrow5/ch0_dimm_label CPU_SrcID#0_Channel#2_DIMM#1 login2$ more /sys/devices/system/edac/mc/mc0/csrow6/ch0_dimm_label CPU_SrcID#0_Channel#3_DIMM#0 login2$ more /sys/devices/system/edac/mc/mc0/csrow7/ch0_dimm_label CPU_SrcID#0_Channel#3_DIMM#1
This information shows four memory channels (0-3) and two DIMMs for each channel (0 and 1). Note that each csrow subdirectory for each memory controller has several EDAC control and attribute files for that csrow.
For example, the output for mc0/csrow0 (Listing 4) shows that all are files (no further subdirectories). The definition of each file is:
Listing 4: mc0/csrow0
login2$ ls -s /sys/devices/system/edac/mc/mc0/csrow0 total 0 0 ce_count 0 ch0_dimm_label 0 edac_mode 0 size_mb 0 ch0_ce_count 0 dev_type 0 mem_type 0 ue_count
-
ce_count
: The total count of correctable errors that have occurred on this csrow (attribute file). -
ch0_ce_count
: The total count of correctable errors on this DIMM in channel 0 (attribute file). -
ch0_dimm_label
: The control file that labels this DIMM. This can be very useful for panic events to isolate the cause of the uncorrectable error. Note that DIMM labels must be assigned after booting with information that correctly identifies the physical slot with the silk screen label on the board itself. -
dev_type
: An attribute file that will display the type of DRAM device being used on this DIMM. Typically this isx1
,x2
,x4
, orx8
. -
edac_mode
: An attribute file that displays the type of error detection and correction being utilized. -
mem_type
: An attribute file that displays the type of memory currently on a csrow. -
size_mb
: An attribute file that contains the size (MB) of memory a csrow contains. -
ue_count
: An attribute file that contains the total number of uncorrectable errors that have occurred on a csrow.
For the sample system, the values for the attribute and control files are shown in Listing 5. This particular csrow has an 8GB DDR3 unbuffered DIMM with no correctable or uncorrectable errors.
Listing 5: Sample System csrow0 Values
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ce_count 0 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count 0 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label CPU_SrcID#0_Channel#0_DIMM#0 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/dev_type x8 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/edac_mode S4ECD4ED login2$ more /sys/devices/system/edac/mc/mc0/csrow0/mem_type Unbuffered-DDR3 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/size_mb 8192 login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ue_count 0
Some attribute files in /sys/devices/system/edac/mc/mc0/
can be very useful (Listing 6). As with the csrow files, note the control and attribute files, which although similar to those for the csrows, are for the entire memory controller:
Listing 6: Attribute Files for mc0
login2$ ls -s /sys/devices/system/edac/mc/mc0 total 0 0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb 0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count 0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
-
ce_count
: The total count of correctable errors that have occurred on this memory controller (attribute file). -
ce_noinfo_count
: The total count of correctable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file). -
mc_name
: The type of memory controller being utilized (attribute file). -
reset_counters
: A write-only control file that zeroes out all of the statistical counters for correctable and uncorrectable errors on this memory controller and resets the timer indicating how long it has been since the last reset (counter zero). The basic command is
echo <anything> /sys/devices/system/edac/mc/mc0/reset_counters
where <anything>
is literally anything (just use a 0
to make things easy).
-
sdram_scrub_rate
: An attribute file that controls memory scrubbing. The scrubbing rate is set by writing a minimum bandwidth in bytes per second to the attribute file. The rate will be translated to an internal value at the specified rate. If the configuration fails or memory scrubbing is not implemented, the value of the attribute file will be-1
. -
seconds_since_reset
: An attribute file that displays how many seconds have elapsed since the last counter reset. This can be used with the error counters to measure error rates. -
size_mb
: An attribute file that contains the size (MB) of memory that this memory controller manages. -
ue_count
: An attribute file that contains the total number of uncorrectable errors that have occurred on this memory controller. -
ue_noinfo_count
: The total count of uncorrectable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).
For the sample system, the values for the attribute and control files are shown in Listing 7. Notice that I can't read the reset_counters
file because it is just a control file for resetting the memory error counters. However, it has been 27,759,752 seconds (7,711 hours or 321 days) since the counters were reset (basically, since the system was booted), and the memory controller is managing about 64GB of memory, with no correctable errors (CEs) or uncorrectable errors (UEs) on the system.
Listing 7: Sample System mc0 Values
login2$ more /sys/devices/system/edac/mc/mc0/ce_count 0 login2$ more /sys/devices/system/edac/mc/mc0/ce_noinfo_count 0 login2$ more /sys/devices/system/edac/mc/mc0/mc_name Sandy Bridge Socket#0 login2$ more /sys/devices/system/edac/mc/mc0/reset_counters /sys/devices/system/edac/mc/mc0/reset_counters: Permission denied login2$ more /sys/devices/system/edac/mc/mc0/sdram_scrub_rate login2$ more /sys/devices/system/edac/mc/mc0/seconds_since_reset 27759752 login2$ more /sys/devices/system/edac/mc/mc0/size_mb 65536 login2$ more /sys/devices/system/edac/mc/mc0/ue_count 0 login2$ more /sys/devices/system/edac/mc/mc0/ue_noinfo_count 0
Also notice that the system is using Sandy Bridge processors (mc_name
). Recall that with newer processors, the memory controller is in the processor. Consequently, the memory controller (mc) will be listed as a processor.
System Administration Recommendations
The edac module in the sysfs filesystem (i.e., /sys/
) has a huge amount of information about memory errors. Normally, you wouldn't expect memory errors, either correctable or uncorrectable, to occur very often. However, as a good administrator, you should periodically scan your systems for memory errors.
Writing a simple script to read the file attributes of the memory errors for a system's memory controllers is not difficult, and you can even store these in a simple database if you like.
I also found a Nagios plugin [8] that should allow you to check for memory errors, although I haven't tested it. This plugin can be run as a simple script and gives you a bit of information; however, I prefer a simple script that tells me whether I have any problems and where they are, so I modified the original script, which is really close to the original (Listing 8).
Listing 8: Modified Nagios Plugin
01 #!/bin/bash 02 # 03 # Original script: https://bitbucket.org/darkfader/nagios/src/c9dbc15609d0/check_mk/edac/plugins/edac?at=default 04 05 # The best stop for all things EDAC is http://buttersideup.com/edacwiki/ and the edac.txt in the kernel doc. 06 07 # EDAC memory reporting 08 if [ -d /sys/devices/system/edac/mc ]; then 09 # Iterate all memory controllers 10 for mc in /sys/devices/system/edac/mc/* ; do 11 ue_total_count=0 12 ce_total_count=0 13 for csrow in $mc/csrow* ; do 14 ue_count=`more $csrow/ue_count` 15 ce_count=`more $csrow/ce_count` 16 if [ "$ue_count" -gt 1 ]; then 17 ue_total_count=ue_total_count+$ue_count; 18 fi 19 if [ "$ce_count" -gt 1 ]; then 20 ce_total_count=ce_total_count+$ce_count; 21 fi 22 done 23 echo " Uncorrectable error count is $ue_total_count on memory controller $mc " 24 echo " Correctable error count is $ce_total_count on memory controller $mc " 25 done 26 fi
I could have just queried the CE and UE memory error counts for the memory controller (mc), but I chose to search through the channel tables, so I could modify the script, just in case I want to point to the specific DIMM reporting more than zero errors.
You can modify this script to return the UE and CE count values. High-performance computing people can also put this script into something like Ganglia [9] to track memory error counts. A simple cron job could run this script, although I don't think you would want to run it every minute. Running it once an hour at most or maybe once a day is more reasonable. If you start to notice the correctable error count climb slowly, you might want to run the script more often.
Notice that I didn't compute "error rates." Some vendors want to know this number, possibly to judge whether the memory is bad, but I will leave this exercise up to you.
Finally, if you see a correctable error, it does not mean the memory DIMM is bad. However, if you see one, keep checking that DIMM, just in case. If the error count keeps rising, you might want to contact your system vendor. Vendors typically do not publish correctable or uncorrectable error rates, but you can call them and discuss what you are seeing on your system, because there might be a threshold at which they will replace the DIMM (they will usually discuss this with their memory vendor).