Features Ten Top Knoppix Rescue Tools Lead image: Rafal Olechowski, 123RF.com

Touring the top Knoppix debugging and rescue tools

To the Rescue

Knoppix boots from the DVD drive and comes with a collection of powerful troubleshooting tools. Knoppix creator Klaus Knopper counts down his favorite rescue utilities. By Klaus Knopper

Most "automatic disk repair tools" break more than they recover by shredding already partly defective media to pieces in an attempt to automatically "repair" inconsistencies. Because these programs usually don't know anything about the facts that caused the original problem, they often make false assumptions on how to proceed.

Seasoned sys admins typically prefer to arrive for system rescue scenarios with a collection of simpler, more task-oriented tools.

The Knoppix DVD included with this issue comes with several useful rescue utilities that admins use every day to diagnose and repair broken systems. Because Knoppix is a Live system, it is designed to run from the DVD drive, which means you can use it to troubleshoot computers that won't boot normally because of mis-configuration or disk errors. Built-in NTFS support and other onboard Windows-ready utilities mean that Knoppix is well suited for troubleshooting Windows as well as Linux systems.

We asked Knoppix creator Klaus Knopper to tell us about his top 10 Knoppix rescue tools or tool combinations.

1. lspci, dmesg, and the /proc and /sys Filesystems

The virtual procfs filesystem, which is usually mounted as /proc, not only contains directories for each running process (hence its name), but also system information and interfaces for kernel subsystems such as network and filesystems. You can obtain critical information with the simplest text display program cat, and options can be set with echo.

Table 1 shows some examples of /proc commands. The output of cat /proc/interrupts is shown in Listing 1.

Tabelle 1: /proc Filesystem Commands

cat /proc/cpuinfo	Read CPU information
cat /proc/meminfo	Read memory usage information
cat /proc/interrupts	Read interrupt allocation
cat /proc/partitions	Read detected disk partitions
echo 1 > /proc/sys/net/ipv4/ip_forward	Enable IP forwarding

Listing 1: cat /proc/interrupts Output

01            CPU0    CPU1
02   0:       1021       0   IO-APIC-edge      timer
03   1:     150241       0   IO-APIC-edge      i8042
04   9:    4024535       0   IO-APIC-fasteoi   acpi
05  12:   13426923       0   IO-APIC-edge      i8042
06  14:    1386145       0   IO-APIC-edge      pata_sch
07  15:          0       0   IO-APIC-edge      pata_sch
08  16:          9       0   IO-APIC-fasteoi   pciehp
09  17:    3398067       0   IO-APIC-fasteoi   pciehp, ath9k
10  18:          0       0   IO-APIC-fasteoi   uhci_hcd:usb4
11  19:    2384012       0   IO-APIC-fasteoi   ehci_hcd:usb1
12  20:     199725       0   IO-APIC-fasteoi   uhci_hcd:usb2
13  21:    1487680       0   IO-APIC-fasteoi   uhci_hcd:usb3
14  22:   22998855       0   IO-APIC-fasteoi   psb@pci:0000:00:02.0
15  23:    3084145       0   IO-APIC-fasteoi   hda_intel
16  24:          1       0   PCI-MSI-edge      eth0
17  ...

The sysfs filesystem, mounted at /sys, lets you read your system's hardware configuration and set parameters during run time.

To list hard disks (with names) in your computer, enter the following:

cd /sys/block; for i in sd*; do echo -n "$i: "; cat$i/device/model; done

To turn on the Touchpad off LED on the Eee PC keyboard, use:

echo 0 > /sys/devices/platform/eeepc/leds/eeepc::touchpad/brightness

If you search for a special feature with a known name, for example the rfkill switch that powers the wireless antenna, you can use a filesystem tool like find to check to se that it's there, has a driver loaded, or is active:

find /sys -iname \*rfkill\*

For some (not all) WiFi adapters, writing a 1 or 0 to the rfkill file can power on or power off the sender, which could be helpful when the designated keyboard hotkeys fail to work.

Loaded kernel modules and their parameters reside in /sys/module/modulename, where you can easily check for special parameters that were set during module load time or compiled-in defaults.

Reading information from /proc and /sys usually does not require root privileges (unless encryption keys or other non-public information is accessed), but if you want to set a parameter, you need root access.

A command like sudo echo -n 1 > /sys/devices/platform/eeepc/cardr (for switching on the internal card reader), issued as an unprivileged (knoppix) user, will NOT work because the redirection, and thus the opening of the file, has precedence over the sudo command and fails if you are not already root.

Instead, use echo -n 1 | sudo tee /sys/devices/platform/eeepc/cardr, which will copy the echo output to the file using the tee command as root.

The commonly known commands ps (processes status) and lspci (list PCI and other devices) are nice front ends to /proc and /sys. The command

lspci -v -k

displays which hardware is installed in your computer and which kernel module is responsible for it.

The dmesg command reveals a lot about the sequence in which hardware has been detected and initialized by the kernel. dmesg also shows errors that are otherwise invisible.

You might detect hard disk errors first with this command. For example, the dmesg output in Listing 2 shows that the error reading the file test.data was actually caused by an unrecoverable physical read error of the hard disk (which means that parts of the disk are unreadable).

Listing 2: dmesg Output

01 knopper:~# md5sum test.data
02 md5sum: test.data: Input-/output error
03
04 knopper:~# dmesg|tail
05 ide: failed opcode was: unknown
06 end_request: I/O error, dev hda, sector 119422514
07 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
08 hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=119422536, high=7, low=1982024, sector=119422522\

2. rsync

For making backups and copying complex directory trees and mixed filetypes, optionally over a network with encryption and authentication with SSH, rsync is a good choice. The command

rsync -HavP originaldir copy_dir

tries to create an identical copy of a file (or directory), including file type, date, and access rights (option -a).

The -H option makes sure that hard links are recreated on the copy, and -v, together with -P, shows the progress of copying. If interrupted, rsync will pick up the copy where it left off in the previous attempt, making sure that only the parts that have changed are copied.

When using the syntax username@hostname:filename as source or destination, ssh is used for file transfer between computers. Please note that it will NOT read the content of a block device file like /dev/sda; rather, you need to create a block device special file as copy instead.

For reading and writing data to and from block devices, you should use dd or dd_rescue.

3. dd and dd_rescue

Both dd (standard Unix system command) and dd_rescue (specialized rescue derivate of dd) copy data block-wise. The main difference, apart from the syntax, is that dd_rescue can skip read errors efficiently and replace them with zeros without truncating the resulting output file.

To copy a "good" disk to an image file, use the command

dd if=/dev/sda of=sda.img bs=1M

which will copy the entire disk content of /dev/sda to the image file sda.img. However, dd will quit on errors, or at least change the data layout if called with flag conv=noerror.

dd_rescue is the best tool for reading partially defective disks (Listing 3).

Listing 3: Reading a Partially Defective Disk

01 dd_rescue -A /dev/sdb sdb.img
02
03 dd_rescue: (info): ipos:   1667072.0k, opos:   1667072.0k, xferd:   1667072.0k
04                    errs:         0, errxfer:      0.0k, succxfer:   1667072.0k
05              +curr.rate: 5270kB/s, avg.rate:  5808kB/s, avg.load:         8.1%

The resulting image, sdb.img, contains what was readable on the original disk, /dev/sdb, with all unreadable data set to zero, so it is still the same size, and (in the best case) it is still mountable via loopback.

In some cases, especially if you lack the time to read the entire disk, using option -r with dd_rescue can be useful. This option reads the disk backward; that is, it reads data from the last block to the first block. Although this option saves time, it doesn't save space; the output file is the size of the source device.

If you know that the partition table of the source device is still good and matches the actual filesystem size, you can as well copy just the partition of interest by replacing sdb with sdb1 in the command shown in Listing 3.

If you are concerned about putting too much stress on already defective hard disk read heads, you can manually skip positions by using the -s startposition option to tell dd_rescue to start behind known defective areas.

4. testdisk and gpart

A common rescue case is a wrong or missing partition table, so either the BIOS cannot find the operating system partition at boot time, or data partitions disappear. This can be caused by Windows-related problems such as viruses that try to install themselves in the master boot record or create a hidden partition.

However, the problem could also arise from mistakes during repartitioning or partition resize attempts, or hardware defects just in the first 512 bytes of the disk. In the last case, copying over the hard disk to a file image (as described in the previous section) is a MUST.

In a PC-style partition table, the first four partitions are defined directly in the first 512 bytes and are therefore called primary partitions, whereas extended partitions are defined at different later sectors.

Creating a backup of your /dev/sda disk's primary partition table is simple with the use of:

dd if=/dev/sda of=sda.mbr bs=512 count=1

To just restore the partition table later, without overriding an existing boot record residing in the first few bytes of the disk, the corresponding command is as follows:

dd if=sda.mbr of=/dev/sda bs=1 count=64 skip=446 seek=446

Both testdisk and gpart attempt to find lost partition starts and ends, or partition sizes that differ from what the partition table claims; testdisk is intended to be used interactively, and gpart is controlled by command-line options.

testdisk has a fancy interactive text-mode ncurses interface, and it can optionally display files and directories on found partitions, so you can make an educated guess about whether or not the partition has been identified correctly. However, the limited file listing is sometimes not reliable and can crash testdisk when the filesystem itself contains severe errors.

If you want to work on a file that was copied from the (defective) old disk rather than risk a change of the original, you can use the losetup method described in a later section and use one of the /dev/loop* devices for simulating a real disk.

I usually just write down the partition limits found by testdisk or gpart and do the real partitioning manually with sfdisk or fdisk, after making a backup of the original partition table blocks, as described in section 5.

5. (s)fdisk

sfdisk is often used in shell scripts to check and repartition a disk drive non-interactively. However, it can also create human-readable backups of both primary and extended partition tables. The command

sfdisk -uS -d /dev/sda > sda.sfdisk

will create an ASCII file that looks like the file shown in Listing 4.

Listing 4: sfdisk Output

01 # partition table of /dev/sda
02 unit: sectors
03
04 /dev/sda1 : start=       63, size=160649937, Id=83, bootable
05 /dev/sda2 : start=160650000, size=  8032500, Id=82
06 /dev/sda3 : start=168682500, size=319709565, Id=83
07 /dev/sda4 : start=        0, size=        0, Id= 0
08

Listing 4 displays the disk partitioning in sector units. You can easily recreate the partition table by feeding sda.sfdisk as input to sfdisk:

sfdisk < sda.sfdisk

fdisk is probably the best-known command-line tool for interactive partitioning tasks, and it has an output format that is easier to read, as you can see in Listing 5.

Listing 5: fdisk Output

01 fdisk -l /dev/sda
02
03 Disk /dev/sda: 250.1 GB, 250059350016 bytes
04 255 heads, 63 sectors/track, 30401 cylinders
05 Units = cylinders of 16065 * 512 = 8225280 bytes
06 Disk identifier: 0xabf319e9
07
08 Device Boot         Start         End      Blocks   Id  System
09 /dev/sda1   *           1       10000    80324968+  83  Linux
10 /dev/sda2           10001       10500     4016250   82  Linux swap / Solaris
11 /dev/sda3           10501       30401   159854782+  83  Linux

When called interactively with the fdisk /dev/sda command, fdisk provides a simple command-line menu that allows you to change partition and disk parameters, which are stored back to disk when leaving fdisk with the w command. Leaving with the q command will leave the partition table unchanged, so it's quite safe to first conduct all changes, check the results with p (print partition table), and then decide whether it's alright to save.

Both fdisk and sfdisk can work on files instead of disks, although for files, you need to tell fdisk the desired "disk geometry" first because this information cannot be gathered from a block device when you are just working with a file copy.

6. losetup and cryptoloop

Working with copies of the disk in the form of an image file makes more sense if you can tell Linux to handle the file as if it were a real disk, including the I/O controls, which are used to read geometry and partition sizes. For this feature, the loopback device kernel module (which has nothing to do with the network loopback lo 127.0.0.1) comes in handy.

The loopback feature maps a file to one or more block devices in /dev/loop*. In this way, your system gets to see additional disks and partitions once the file is attached with the losetup command. For example,

losetup /dev/loop0 sda.img

will attach the file sda.img to loopback device /dev/loop0, if it is free. Afterward, tools like testdisk or fdisk will work on the additional disk.

For accessing partitions inside the image (or on a block device that lacks a partition record, if you know where the partitions are located), you can use the offset parameter:

losetup /dev/loop0 sda.img -o offset_in_bytes

Now, to calculate the offset correctly, you need to have access to the partition information.

If, for example, the partition record is good or was obtained from testdisk or gpart, you need to multiply the start location from sfdisk -uS or fdisk -lu by 512 to get the offset in bytes.

In Listing 6, the start of partition 1 is sector 8192, so losetup would map the first partition correctly with the following command:

losetup /dev/loop0 sda.img -o $((512 * 8192))

Listing 6: sfdisk Output

01 sfdisk -uS -d sda.img
02
03 # partition table of sda.img
04 unit: sectors
05
06 sda.img1 : start=     8192, size=   7736320, Id= b
07 sda.img2 : start=        0, size=         0, Id= 0
08 ...

You can use the $(()) construct of the Bash shell for in-place arithmetic calculations, or you can just calculate the result by yourself.

Now it should be possible to mount the partition. The filesystem should know where it starts and ends, so it does not really matter that the losetup-mapped partition spans the rest of the disk. If you want to specify the filesystem manually and not use auto-detection, use the command

mount /dev/loop0 /mnt

mount -t filesystem-type /dev/loop0 /mnt

The mount command has an interface to loopback, so you can do both steps, attaching the file to a loopback device and mounting the partition, at once:

mount -o loop,offset=$((512 * 8192)) sda.img /mnt

Now, files can be copied from (and to, if you did not mount read-only with -r) the mounted partition.

If mount cannot find a free loop device on its own, you can specify which one to use by replacing

loop

with:

loop=/dev/loop1

This approach also works for Knoppix-like cloop compressed files, if using loop=/dev/cloop1 (or any other free cloop device), in which the image file is a cloop-compressed image. However, the read-only mount option -r might be mandatory, depending on your version of losetup and mount.

The losetup command supports the cryptoloop extension, which is present in the kernel, by way of the following options:

-e – crypt type
-k – keylength in bits
-N – use the passphrase as the real key; don't use a hash algorithm. For example:

losetup -e aes -k 256 -N /dev/loop0 partition.img

If your distribution uses a special derivate of losetup from the loop-aes package, the command uses the following options:

-e – crypt type including keybots.
-H – hash algorithm for mangling keyphrases. The example above will look like this

losetup -e AES256 -H unhashed2 /dev/loop0partition.img

letting you mount the same encrypted partition. Mounting is as before,

mount /dev/loop0 /mnt

but now, the data are read and stored on the hard disk in encrypted form, and the mountpoint is a decrypted view of the filesystem.

If losetup produces strange errors only when using encryption, you might have to load the encryption modules first, as in this example:

for m in loop cryptoloop aes_generic aes_i586 cbc; do
 [ -d /sys/module/"$m" ] || modprobe "$m"
done

Don't forget to detach the file from loopback after umount; otherwise, you won't get rid of references, and the loopback device will remain blocked.

umount does this automatically when using the -d option for "loopback detach":

umount -d /mnt

In case you forgot -d when unmounting,

losetup -d /dev/loop0

will detach the file from /dev/loop0.

7. mount

I discussed most of mount's extended features in connection with losetup in the previous section. My last resort, for the unlikely case that neither gpart nor testdisk could find partitions, is the (very slow) Bash script shown in Listing 7, which tries to mount each sector beginning from a "start" offset to the end of the disk, trying all filesystem types known to the kernel.

Listing 7: Looking for Partitions

01 #!/bin/bash
02
03 # This is the "dd" or "dd_rescue"-generated disk image
04 DEV="disk.img"
05
06 # Destination mountpoint, existing unused directory
07 MNT=/mnt
08
09 # Higher values speed up searching, but may miss the actual start of a filesystem
10 # Minimum is 512, and the value MUST be a multiple of 512, which is the sector size.
11 SKIP="4096"
12
13 # Last offset to try
14 final_offset="$(du -b "$DEV" | cut -f 1)"
15
16 # A free loop device
17 LOOP="/dev/loop2"
18
19 # Start
20 offset=0
21
22 while [ "$offset" -lt "$final_offset" ]; do
23  echo -n -e "\rTrying $offset..."
24  if mount -o loop="$LOOP",ro,offset=$offset $DEV $MNT >>mount.log 2>&1 ; then
25   echo -e "\nFound a filesystem, now mounted on $MNT"
26   break
27  fi
28  losetup -d "$LOOP" >/dev/null 2>&1
29  let offset+="$SKIP"
30 done

A log is kept in mount.log. Because it uses privileged commands, the script must run with root permissions.

8. foremost and photorec

Some tasks, such as rescuing individual files or finding deleted files, can be handled quickly by looking into the filesystem-independent data.

The foremost command searches for well-known file types by inspecting the data, and it creates new files for matches that were found in a new directory. For example,

foremost sda.img

will search inside the disk image sda.img for all filetypes known to foremost (which is a variety of picture formats like PEG and PNG, movies, and office document types – all with a header containing the file size) and put all files found into subdirectories of the default output directory named output.

If Unix-ish filesystems are involved, which contain "indirect" blocks (i.e., blocks in which the file is not written in consecutive sectors but spread over the disk with linked blocks), option -d might be helpful. If some files cannot be restored completely because they are partly overwritten, you can try the foremost option -a, which also recovers fragments. The use of this option can multiply the amount of data that is recovered because overlapping parts are duplicated.

photorec does much more than its name suggests. Like foremost, photorec can recover pictures and other multimedia and document formats, but photorec is interactive. Its look and feel resembles testdisk, perhaps because it comes from the same authors, which is also why it is located in the same package with testdisk under Debian.

If you just want to rescue pictures and documents from an accidentally "formatted" flash disk (wherein "format" in most digital cameras just means that the file index is erased), you should try photorec or foremost first.

9. hexedit

Sometimes, none of the tools mentioned so far in this article succeeds in discovering a file because most of the partition's contents is overwritten or unreadable. Still, it might be possible to get at least a few fragments from files that contain important information that you don't want to lose. hexedit is the tool I use to check for known text or byte combinations, to find out whether the disk is still worth investigating for lost data.

hexedit can work on file-based disk-images as well as disks or partitions. It is used interactively, and you can switch between the "binary" (hex) view on the left side, and a textual representation on the right side with the Tab key. This is especially important if you search for text fragments with the / search command. Before searching text, you have to switch to the right (text) side with Tab; otherwise, your search mode will be binary/hex.

Figure 1 shows how an NTFS volume looks when edited with hexedit.

Figure 1: Editing an NTFS volume with hexedit.

In this example, the ASCII search string JFIF is used to search for headers of JPEG pictures.

Searching for text fragments can help you find text files or older backups. Journaling filesystems typically don't overwrite an old version of a file once a new version of the same file is stored, so you might find a lot of backups lying around on the partition, even if the file has meanwhile been deleted.

If a fragment is discovered, you can mark the beginning and end of the data you want to keep and save it to a new file. If you don't know exactly where the file starts and ends, just select a generous range, store everything you need, then cut the file to the right size later.

hexedit can also be used to edit internal structures like partition tables in binary mode, and if you know some architecture-specific machine binary code, you can alter the behavior of bootloaders or other programs by directly editing them on-disk then saving back your changes.

10. ntfsprogs and ntfs-3g

Since the availability of ntfs-3g (http://www.tuxera.com/), a free userspace driver for NTFS filesystems, NTFS has become a good choice for exchanging data with Windows users.

NTFS, as opposed to FAT32, supports files larger than 4GB and has a few filesystem features similar to Unix, such as extended file attributes and permissions. When installing the ntfs-3g package (which is already present on Knoppix), the mount command filesystem helpers are extended by a new type, -t ntfs-3g, allowing similar syntax in the mount and ntfs-3g commands. In some distributions, the default for mount -t ntfs is also changed to match the ntfs-3g default. Unlike the kernel's internal NTFS driver, ntfs-3g has full write support. You can mount an NTFS filesystem with:

ntfs-3g /dev/partition /mountpoint

mount -t ntfs-3g /dev/partition /mountpoint

ntfs-3g will try to repair some internal structures of NTFS automatically, in case the filesystem is in an inconsistent state, at least to a degree that it can be mounted normally.

ntfsfix is part of the (ntfs-3g-independent) ntfsprogs package. All it does is reset the NTFS journal and mark the filesystem as "to-be-checked" on the next Windows reboot. This step can help Windows fix errors on startup, but mounting the filesystem with ntfs-3g should seldom need this as a requirement for accessing defective NTFS volumes. (I did once have a case where it was necessary, though.)

ntfsclone is a tool that creates a very exact copy of an NTFS filesystem and stores everything to a "sparse" file, which means that only the data parts really used for files are physically copied and allocated space in the destination file. It analyzes the filesystem structure and skips sectors that are not in use.

A simple cp or rsync on a mounted filesystem will basically do the same thing but will not keep the NTFS-specific extended attributes intact, so the copied filesystem might not start correctly in Windows.

If ntfs3g fails to repair and mount a defective NTFS volume, you can still try the read-only kernel-internal ntfs filesystem by adding the -i option to the mount command, which will keep mount from calling mount.ntfs.

mount -r -i -t ntfs /dev/sda1 /mnt

These are my top 10 rescue tools. I hope something here will help you the next time you come up against a broken system.

Troubleshooting with Klaus

My general rules for recovery are:

Know the system you are working on.
Read or write only as many times as really necessary (especially if you think the medium might be defective). When you read a partly defective sector over and over again, disk corruption can get worse, which might lead to losing a complete disk read/write head.
Copy the complete disk, not just the parts you assume are relevant (unless you don't have enough time for a complete copy). Other parts of the disk might provide clues that will make it easier to find lost data.
Always work with a copy when writing back changes. I use disk images (files) rather than writing on a raw disk. Make sure that the filesystem can handle files the size of the complete original disk you want to recover. FAT32, for example, can only handle files of up to 4GB each. ext3 or ReiserFS are OK for storing large files. You might need to buy a fresh disk.
Be patient. Data transfer over USB2 is not very speedy; you can expect about 5MBps, so make sure you have enough time to spend on physically copying over huge amounts of data.
Be sure you use the right source and target device before issuing a command, especially when working as root, which is necessary for commands that send privileged I/O controls to the kernel.