A closer look at hard drives
Disco Mania
In the previous issue, I set down some history and the basic hard drive layout and operation background as a prelude to fully diving into the subject in this second part of my series. Resuming from where I left off, I'll try find out everything that my laptop knows about its internal drive (Figure 1).
I have an 80GB Intel 320 SSD, performing remarkably close to its specified sequential read rating of 270MBps [1], but it is the second-generation drive's write performance that demonstrates the significant benefits of the TRIM [2] extension, enabling an SSD to distinguish a true overwrite operation from a write onto unallocated free space. Because a drive's logic has no insight into filesystem structure, these two operations were previously indistinguishable, needlessly degrading SSD write performance. The TRIM option enables the filesystem to notify disks of file deletions – resolving this problem in most configurations not involving RAID, which is still negatively affected.
The /etc/fstab
file shows that this partition is installed with Ubuntu 12.04's default ext4 filesystem, which is indeed capable of issuing TRIM messages to the disk. But, had I not studied Intel's spec sheets, how would I know that the disk can make use of such functionality?
The hdparm
[3] tool exposes all the disk's details to the administrator's prying eye. From the output presented in Listing 1, you can determine that this is indeed a solid state device (line 25), supporting SATA 2 (line 62) but not SATA 3 (Gen 3 signaling is not listed), supporting the TRIM (line 67) and SMART (line 37) extensions. Hdparm provides access to settable parameters as well, and makes it rather easy to corrupt or even delete a filesystem. A useful example is the secure erase extension enumerated at line 77, which is not even the most dangerous of all available options, so thread very carefully.
Listing 1: Output of hdparm -I /dev/sda on an Intel 320 SSD
1 /dev/sda: 2 3 ATA device, with non-removable media 4 Model Number: INTEL SSDSA2M080G2GC 5 Serial Number: XXXXXXXXXXXXXXXXXX 6 Firmware Revision: 2CV102HD 7 Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6 8 Standards: 9 Used: ATA/ATAPI-7 T13 1532D revision 1 10 Supported: 7 6 5 4 11 Configuration: 12 Logical max current 13 cylinders 16383 16383 14 heads 16 16 15 sectors/track 63 63 16 -- 17 CHS current addressable sectors: 16514064 18 LBA user addressable sectors: 156301488 19 LBA48 user addressable sectors: 156301488 20 Logical Sector size: 512 bytes 21 Physical Sector size: 512 bytes 22 device size with M = 1024*1024: 76319 MBytes 23 device size with M = 1000*1000: 80026 MBytes (80 GB) 24 cache/buffer size = unknown 25 Nominal Media Rotation Rate: Solid State Device 26 Capabilities: 27 LBA, IORDY(can be disabled) 28 Queue depth: 32 29 Standby timer values: spec'd by Standard, no device specific minimum 30 R/W multiple sector transfer: Max = 16 Current = 1 31 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 32 Cycle time: min=120ns recommended=120ns 33 PIO: pio0 pio1 pio2 pio3 pio4 34 Cycle time: no flow control=120ns IORDY flow control=120ns 35 Commands/features: 36 Enabled Supported: 37 * SMART feature set 38 Security Mode feature set 39 * Power Management feature set 40 * Write cache 41 * Look-ahead 42 * Host Protected Area feature set 43 * WRITE_BUFFER command 44 * READ_BUFFER command 45 * NOP cmd 46 * DOWNLOAD_MICROCODE 47 SET_MAX security extension 48 * 48-bit Address feature set 49 * Device Configuration Overlay feature set 50 * Mandatory FLUSH_CACHE 51 * FLUSH_CACHE_EXT 52 * SMART error logging 53 * SMART self-test 54 * General Purpose Logging feature set 55 * WRITE_{DMA|MULTIPLE}_FUA_EXT 56 * 64-bit World wide name 57 * IDLE_IMMEDIATE with UNLOAD 58 * WRITE_UNCORRECTABLE_EXT command 59 * {READ,WRITE}_DMA_EXT_GPL commands 60 * Segmented DOWNLOAD_MICROCODE 61 * Gen1 signaling speed (1.5Gb/s) 62 * Gen2 signaling speed (3.0Gb/s) 63 * Native Command Queueing (NCQ) 64 * Phy event counters 65 Device-initiated interface power management 66 * Software settings preservation 67 * Data Set Management TRIM supported (limit 8 blocks) 68 * Deterministic read ZEROs after TRIM 69 Security: 70 Master password revision code = 65534 71 supported 72 not enabled 73 not locked 74 not frozen 75 not expired: security count 76 supported: enhanced erase 77 2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. 78 Logical Unit WWN Device Identifier: 500151795934ceda 79 NAA : 5 80 IEEE OUI : 001517 81 Unique ID : 95934ceda 82 Checksum: correct
Hdparm includes a simple benchmark, which you can use to compare cached [4] and raw disk performance on a simple level:
$ hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 616 MB in 3.00 seconds = 205.03 MB/sec $ hdparm -T /dev/sda /dev/sda: Timing cached reads: 6292 MB in 2.00 seconds = 3153.09 MB/sec
If this were a spinning disk, you would also be able to verify the performance degradation of ZCAV encoding [5] with the --offset
option, but an SSD will exhibit similar timings regardless of where the disk is tested.
The other tool that can be relied upon to examine a disk's status is smartctl
, which provides access to a drive's Self-Monitoring, Analysis, and Reporting Technology (SMART) [6] metrics. Many people have remained skeptical about the actual predictive ability of SMART data, despite a study of more than 100,000 disks at Google [7]. Yet, Reallocated Sector Count (Table 1, attribute 5) remains a key monitoring metric of both solid state and rotational storage media. Table 1 shows all attributes available for my drive – bear in mind that the selection of attributes is different between vendors and changes with storage technology.
Tabelle 1: SMART Attributes Tracked by an Intel 320 SSD
ID |
Attribute Name |
Hex Flag |
Value |
Worst |
Threshold |
Type |
Updated |
When Failed |
Raw Value |
---|---|---|---|---|---|---|---|---|---|
3 |
Spin_Up_Time |
0x0020 |
100 |
100 |
000 |
Old_age |
Offline |
– |
0 |
4 |
Start_Stop_Count |
0x0030 |
100 |
100 |
000 |
Old_age |
Offline |
– |
0 |
5 |
Reallocated_Sector_Ct |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
0 |
9 |
Power_On_Hours |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
2456 |
12 |
Power_Cycle_Count |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
501 |
192 |
Unsafe_Shutdown_Count |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
40 |
225 |
Host_Writes_32MiB |
0x0030 |
200 |
200 |
000 |
Old_age |
Offline |
– |
17343 |
226 |
Workld_Media_Wear_Indic |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
715 |
227 |
Workld_Host_Reads_Perc |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
0 |
228 |
Workload_Minutes |
0x0032 |
100 |
100 |
000 |
Old_age |
Always |
– |
4278402805 |
232 |
Available_Reservd_Space |
0x0033 |
100 |
100 |
010 |
Pre-fail |
Always |
– |
0 |
233 |
Media_Wearout_Indicator |
0x0032 |
099 |
099 |
000 |
Old_age |
Always |
– |
0 |
184 |
End-to-End_Error |
0x0033 |
100 |
100 |
099 |
Pre-fail |
Always |
– |
0 |
ioping
[8] is a most convenient simple I/O benchmark, and it's my new favorite way to monitor disk latency in real time:
4096 bytes from /dev/sda (device 74.5 Gb): request=1 time=0.1 ms 4096 bytes from /dev/sda (device 74.5 Gb): request=2 time=0.2 ms 4096 bytes from /dev/sda (device 74.5 Gb): request=3 time=0.2 ms ...
The ioping utility can target a device, a directory, or a file if appropriate. You can generate a disk load through stress
[9] or by compiling a kernel and watching the results on your system's I/O latency.
A broader system-wide view is produced by iotop
[10], which provides a hierarchical ranking of I/O bandwidth usage by process (or thread) as consumed during the sampling interval. Figure 2 shows an artificial load's impact on the system write performance. Because kernel 2.6.13 introduced the CFQ scheduler, the Linux kernel has allowed the setting of a process's I/O class and priority (the "PRIO" field in iotop's listing), which can be tuned through the ionice
[11] command.
Classes include the lower priority "idle" class, which receives disk access only as no other requests are pending, as well as the default "best effort" class. The high-priority "real time" class must be used with care, because its unconditionally immediate disk access may easily starve other I/O processe. A finer-grained priority level between 0 (highest) and 7 (lowest) can be further set for classes other than idle, and it is a safer option in most cases.