Tuning your filesystem's cache
Tune-Up
Anyone who has spent time tuning the performance of a system is keenly aware of the bottlenecks existing between permanent storage on disk and working storage in RAM. As a rule, one should think of a hard drive as being three orders of magnitude slower than random-access memory. Because of this disparity, modern operating systems are designed to cache the reads and buffer the writes of the storage subsystem.
In Linux, the kernel makes generous allocation of all memory not used by any program for behind-the-scene purposes; I focus on the disk page cache [1] here. If system load increases at the start of a new application, the system will evict caches as required to provide working memory for the new process – the illusion of free memory is preserved, even though those chips were most likely hard at work for the kernel all along.
The disk page cache uses memory pages in RAM to access rapidly data stored in disk blocks, with the kernel using "temporal locality" as the organizing principle to determine what to cache. The assumption that recently accessed blocks of data are more likely to be accessed again drives the kernel's preferences. Given that on most systems not all permanent storage can be cached simultaneously, it makes sense to optimize the disk cache according to your workload. For example, you can ensure a uniform performance experience for clients by pre-caching the directory tree yourself, rather than letting page cache misses cause the first load.
Enter Amazon EC2: If you often launch new instances, those new VMs will initially exhibit this uneven performance behavior. Elastic Block Storage (EBS) volumes are far from high-performance IOPS devices, making the disparity more obvious. The team behind the smartphone app Instagram illustrated this in the story of their successful EC2 scaling experience [2].
A good tool for cache control is vmtouch [3], which is at home on most Linux 2.6, FreeBSD 7.x, or Solaris 10 kernels; mileage may vary on other *nix variants. Expect partial functionality on Mac OS X and OpenBSD.
To check out my Ubuntu 11.04 laptop's /bin
caching status, I enter:
$ ./vmtouch /bin
Vmtouch crawled the files in /bin
and reported that about half of the memory pages for the utilities there were cached. This is done via the mincore(2)
system call [4], which was first introduced in 4.4 BSD but is not present in all *nix systems. It provides information on the page fault status of a process's pages.
This tool provides a quick way to determine core memory resident status, but vmtouch can manipulate that status as well (Listing 1). The -t
option "touches" the pages of the file, causing it to be loaded into the page cache; similarly, the -e
option evicts a file's pages from memory. For faster access, you can use these options to load a set of files or directories into RAM; however, over time, the kernel might choose to evict less often used files in favor of other, more recently used files that are not the object of your performance concerns. This is not the behavior you are trying to enforce. Vmtouch comes to the rescue with its most powerful option: -dl
, for "daemonize and lock."
Listing 1: Vmtouch Gzip Example
01 $ ./vmtouch /bin/gzip 02 Files: 1 03 Directories: 0 04 Resident Pages: 0/15 0/60K 0% 05 Elapsed: 8.1e-05 seconds 06 $ ./vmtouch -vt /bin/gzip 07 /bin/gzip 08 [OOOOOOOOOOOOOOO] 15/15 09 10 Files: 1 11 Directories: 0 12 Touched Pages: 15 (60K) 13 Elapsed: 0.001102 seconds 14 $
Daemonize and lock operates just like -t
(touch), but it calls mlock(2)
on all memory mappings and won't close the descriptors when finished. Vmtouch then goes into daemon mode, sitting in the background to keep those pages loaded in memory, at least until your next reboot. Now you can ensure that files are permanently in the page cache – use it wisely, because overtaxing a system's RAM could ruin the very performance metric you are tuning.
The Instagram team has released a vmtouch script [5] that dumps the page cache status of an EC2 instance, allowing you to carry over the page-fault state of a running system to a newly spawned one, rather than starting from a fixed set of files and directories.