Nuts and Bolts Tracking Memory Errors Lead image: Lead Image © lightwise, 123RF.com

Finding and recording memory errors

Amnesia

Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins. By Jeff Layton

A recent article in IEEE Spectrum [1] by Al Geist, titled "How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder," reviewed some of the major ways a supercomputer can be killed. The first subject the author discussed was how cosmic rays can cause memory errors, both correctable and uncorrectable. To protect against some of these errors, ECC (error-correcting code) memory [2] can be used.

The general ECC memory used in systems today can detect and correct single-bit errors (changes to a single bit). For example, assume a byte with a value of 156 (10011100) is read from a file on disk; if the second bit from the left is flipped from a 0 to a 1 (11011100), the number becomes 220. A simple flip of one bit in a byte can make a drastic difference in its value. Fortunately, ECC memory can detect and correct the bit flip, so the user does not notice.

The current ECC memory also can detect a double bit flip, but it cannot correct that. When a double bit error happens, the memory should cause a machine check exception (MCE) [3], which should crash the system. The bad data in memory could be related to an application or to instructions in an application or the operating system. Rather than risk any of these scenarios, the system rightly crashes, indicating the error as best it can.

The Wikipedia article on ECC states that most of the single-bit flips are due to background radiation, primarily neutrons from cosmic rays. The article reports that error rates from 2007 to 2009 varied quite a bit, ranging from 10-10 to 10-17 errors/bit-hour, which is seven orders of magnitude difference. The upper number is just about one error per gigabit of memory per hour. The lower number indicates roughly one error every 1,000 years per gigabit of memory.

A Linux kernel module called EDAC [4], which stands for error detection and correction, can report ECC memory errors and corrections. EDAC can capture and report error information for hardware errors in the memory or cache, direct memory access (DMA), fabric switches, thermal throttling, HyperTransport bus, and others. One of the best sources of information about EDAC is the EDAC wiki [5].

Important Considerations

Monitoring ECC errors and corrections is an important task for system administrators of production systems. Rather than monitor, log, and maybe alarm based on the absolute number of ECC errors or corrections, the rate of change of errors and corrections should be monitored. The unit of measure to use is up to the system administrator, but the commonly used unit reported in the Wikipedia article is errors per gigabit of memory per hour.

In a previous article [6], I wrote a general introduction to ECC memory, specifically about Linux and memory errors, and how to collect correctable and uncorrectable errors. In typical systems, such as the one examined in the article, you can have more than one memory controller. The example was a two-socket system with two memory controllers, mc0 and mc1. You can get this information with the command:

$ ls -s /sys/devices/system/edac/mc

The memory associated with each memory controller is organized in physical DIMMs, which are laid out in a "chip-select" row (csrow) and a channel table values.

According to the kernel documentation for EDAC [7], typical memory has eight csrows, but it really depends on the layout of the motherboard, the memory controller, and the DIMM characteristics. The number of csrows can be found by examining the /sys entries for a memory controller. For example:

$ ls -s /sys/devices/system/edac/mc/mc0

The number of elements, labeled csrow<X> (where <X> is an integer) are counted to determine the number of csrows for the memory controller (Listing 1). In this case, I had two memory channels per controller and four DIMMs per channel, for a total of eight values for csrow (csrow0 to csrow7).

Listing 1: Attribute Files for mc0

$ ls -s /sys/devices/system/edac/mc/mc0
total 0
0 ce_count         0 csrow1  0 csrow4  0 csrow7   0 reset_counters       0 size_mb
0 ce_noinfo_count  0 csrow2  0 csrow5  0 device   0 sdram_scrub_rate     0 ue_count
0 csrow0           0 csrow3  0 csrow6  0 mc_name  0 seconds_since_reset  0 ue_noinfo_count

A number of entries in the /sys filesystem entry for each csrow contains a lot of information about the specific DIMM. Listing 2 shows the csrow0 attributes. Each of these entries has information that can be used for monitoring or is a control file. Listing 3 shows a list of system values for some of the mc0 attributes from Listing 1 for the test system. Note that reset_counters is a control file that, unsurprisingly, lets you reset the counters.

Listing 2: Content of csrow0

$ ls -s /sys/devices/system/edac/mc/mc0/csrow0
total 0
0 ce_count      0 ch0_dimm_label  0 edac_mode  0 size_mb
0 ch0_ce_count  0 dev_type        0 mem_type   0 ue_count

Listing 3: Attribute Values of mc0

$ more /sys/devices/system/edac/mc/mc0/ce_count
0
$ more /sys/devices/system/edac/mc/mc0/ce_noinfo_count
0
$ more /sys/devices/system/edac/mc/mc0/mc_name
Sandy Bridge Socket#0
$ more /sys/devices/system/edac/mc/mc0/reset_counters
/sys/devices/system/edac/mc/mc0/reset_counters: Permission denied
$ more /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
$ more /sys/devices/system/edac/mc/mc0/seconds_since_reset
27759752
$ more /sys/devices/system/edac/mc/mc0/size_mb
65536
$ more /sys/devices/system/edac/mc/mc0/ue_count
0
$ more /sys/devices/system/edac/mc/mc0/ue_noinfo_count
0

Three of the entries from Listing 2 are key to a system administrator:

size_mb: The amount of memory (MB) this memory controller manages (attribute file).
ce_count: The total count of correctable errors that have occurred on this memory controller (attribute file).
ue_count: The total number of uncorrectable errors that have occurred on this memory controller (attribute file).

From this information, you can compute both the correctable (ce_) and the uncorrectable (ue_) error rates.

Time to Code Some Tools!

Good system administrators will write simple code to compute error rates. Great system administrators will create a database of the values so a history of the error rates can be examined. With this in mind, simple code takes the data entries from the /sys filesystem for the memory controllers and the DIMMs and writes these values to a text file. The text file serves as the "database" of values, from which historical error rates can be computed and examined.

The first of two tools scans the /sys filesystem and writes the values and the time of the scan to a file. Time is written in "seconds since the epoch" [8], which can be converted to any time format desired. The second tool reads the values from the database and creates a list within a list, which prepares the data for analysis, such as plotting graphs and statistical analyses.

Database Creation Code

The values from the /sys filesystem scan are stored in a text file as comma-separated values (CSV format) [9], so it can be read by a variety of tools and imported into spreadsheets. The code can be applied to any system (host), so data entry begins with the hostname. A shared filesystem is the preferred location for storing the text file. A simple pdsh command can run the script on all nodes of the system and write to the same text file. Alternatively, a cron job created on each system can run the tool at specified times and write to the text file. With central cluster control, the cron job can be added to the compute node instance or pushed to the nodes.

Before writing any code, you should define the data format. For the sake of completeness, all of the values for each csrow entry (all DIMMs) are written to the text file to allow a deeper examination of the individual DIMMs and to allow the error values to be summed for either each memory controller or the entire host (host-level examination).

The final file format is pretty simple to describe. Each row or line in the file will correspond to one scan of the /sys filesystem, including all of the memory controllers and all of the csrow values. Each row will have the following comma-separated values.

Hostname
Time in seconds since epoch (integer)
"mc" + memory controller number 0 to N (e.g., mc0)
For each DIMM (csrow) in each memory controller 0 to N
"csrow" + csrow number (e.g., csrow4)
Memory DIMM size in gigabytes (size_mb)
Correctable error number (ce_count) for the particular csrow
Uncorrectable error number (ue_count) for the particular csrow

A sample of an entry in the database might be:

login2,1456940649,mc0,csrow0,8192,0,0,csrow1,8192,0,0,...

Listing 4 is based on sample code from a previous article [6].

Listing 4: /sys Filesystem Scan

#!/bin/bash
#
#
# Original script:
# https://bitbucket.org/darkfader/nagios/src/
#    c9dbc15609d0/check_mk/edac/plugins/edac?at=default
# The best stop for all things EDAC is
#   http://buttersideup.com/edacwiki/ and
#       edac.txt in the kernel doc.
# EDAC memory reporting
if [ -d /sys/devices/system/edac/mc ]; then
    host_str=`hostname`
    output_str="$host_str,`date +%s`"
    # Iterate all memory controllers
    i=-1
    for mc in /sys/devices/system/edac/mc/* ; do
        i=$((i+1))
        output_str="$output_str,mc$i"
        ue_total_count=0
        ce_total_count=0
        # Iterate all csrow values
        j=-1
        for csrow in $mc/csrow* ; do
            j=$((j+1))
            output_str="$output_str,csrow$j"
            ue_count=`more $csrow/ue_count`
            ce_count=`more $csrow/ce_count`
            dimm_size=`more $csrow/size_mb`
            if [ "$ue_count" -gt 1 ]; then
                ue_total_count=ue_total_count+$ue_count;
            fi
            if [ "$ce_count" -gt 1 ]; then
                ce_total_count=ce_total_count+$ce_count;
            fi
            output_str="$output_str,$dimm_size,$ce_count,$ue_count"
        done
        #echo "  UE count is $ue_total_count on memory controller $mc "
        #echo "  CE count is $ce_total_count on memory controller $mc "
    done
    echo "$output_str" >> /tmp/file.txt
fi

The data is output to a text file in /tmp for testing purposes. This location can be changed; as mentioned earlier, a shared filesystem is recommended.

Analysis Tool

The second tool can be as simple or as complex as desired. A basic function would plot the error rate values for each DIMM for each host and memory controller as a function of time. Additionally, the memory errors could be summed for each host and the memory error rate plotted versus time for each host.

Another use of the tool would be to conduct a statistical analysis of the error rates to uncover trends in the historical data. It could be as simple as computing the average and standard deviation of the error rate over time (looking to see if the error rates are increasing or decreasing) or as complex as examining the error rates as functions of time or location in the data center.

The code in Listing 5 is a very simple Python script that reads the CSV file and creates a list of lists (like a 2D array). Although the code is short, it illustrates how easy it is to read the CSV data. From this point, error rates can be computed along with all sorts of statistical analyses and graphing.

Listing 5: Reading the Scanned Data

#!/usr/bin/python
import csv;
# ===================
# Main Python section
# ===================
#
if __name__ == '__main__':
    with open('file.txt', 'rb') as f:
        reader = csv.reader(f);
        data_list = list(reader);
    # end with
    print data_list;
# end if

Parting Words

As mentioned in the article about how to kill a supercomputer, memory errors, either correctable or uncorrectable, can lead to problems. Keeping track of error rates over time is an important system aspect to be monitored.

A huge "thank you" is owed to Dr. Tommy Minyard at the University of Texas Advanced Computing Center (TACC) and to Dr. James Cuff and Dr. Scott Yockel at Harvard University, Faculty of Arts and Sciences Research Computing (FAS RC), for their help with access to systems used for testing.