Nuts and Bolts Linux Filesystems Lead image: Lead Image © Ulrich Krellner, 123RF.com
Lead Image © Ulrich Krellner, 123RF.com
 

Managing Linux filesystems

Rank and File

Linux filesystems range from block-based network filesystems, to temporary filesystems in RAM, to pseudo filesystems. We explain how filesystems are set up and how to manage them. By Georg Schönberger, Werner Fischer

Imagine a filesystem as a library that stores data efficiently and in a structured way. Without filesystems, persistent data would not be possible. Virtually every Linux system has at least one block-based filesystem (e.g., ext4, XFS, Btrfs). Block-based means that an underlying physical data store is involved, such as a hard drive, solid-state drive (SSD), or SD card. Linux has a number of filesystems from which to choose, and the ext2/3/4 series is likely known by everyone. If you work with a current distribution, you have probably met other filesystems, too (Table 1).

Tabelle 1: Standard Filesystems

Distribution

Filesystem

Debian (from v7.0 wheezy)

ext4

Ubuntu (from v9.04)

ext4

Fedora (from v22)

XFS

SLES (from v12)

Btrfs for the root partition, XFS for data partitions

RHEL 7

XFS

Most filesystems are very similar and differ only in detail. The following terms will help you understand them:

From RAM to Persistent Memory

Random access memory (RAM) has speed advantages over hard drives and SSDs; therefore, the Linux kernel uses a caching mechanism that keeps data in RAM to reduce disk access. This cache is known as the page cache; running the free command reveals its current size (Listing 1). At first glance, 2.7GB of 7.7GB of RAM is available to the system. If the RAM usage for the page cache is deducted, then actually 5.6GB is free. The page cache thus occupies 2.7GB (cached column). The buffers column also belongs to the page cache; buffers is where cached filesystem metadata resides.

Listing 1: Free Space

$ free -h
                      total       used       free       shared       buffers       cached
Mem:                  7.7G        4.9G       2.7G       228M         203M          2.7G
-/+ buffers/cache:    2.1G        5.6G
Swap:                 1.0G          0B       1.0G

The page cache consists of physical pages in RAM, whose data pages are associated with a block device. The page cache size is always dynamic, because it uses any RAM that is not being used by the operating system. If the system suffers from high memory consumption, the page cache size is reduced, freeing up memory for applications.

The page cache is a write-back cache, which means it buffers both read and write data. A read from the block device propagates the data to the cache, which is then passed to the application. A write access lands directly in the page cache and not immediately on the block device. Data pages modified while in the page cache are called "dirty pages," because the modified data has not yet been written to persistent storage. Gradually, the Linux kernel writes data from RAM to the block device.

In addition to periodically writing data through the kernel, ext4 explicitly synchronizes its data and metadata using an interval of five seconds by default. You can change the sync time if necessary with the commit option to the mount command (see the ext4 documentation at kernel.org [2]). In the worst case, the data still in the RAM is lost in a sudden power outage. The longer the commit interval, the greater the risk of data loss.

The use of RAM as a cache provides huge performance advantages for the user. Don't forget, however, that RAM is volatile and not persistent. This fact forced itself into the awareness of many ext4 users recently when the "data corruption caused by unwritten and delayed extents" bug caused a stir [3]. On ext4, ephemeral files may never even reach the block device [4] under certain circumstances because of "delayed allocation."

Unlike ext3, ext4 delays allocating physical write blocks so the filesystem can accumulate data and allocate contiguous blocks later. This method gains the user a speed advantage when reading and writing the data while in RAM. Because ext4 cannot write unallocated blocks, they depend on the kernel to flush them out, which can translate to minutes in RAM instead of five seconds. Ext4 is not the only filesystem that uses this acceleration action: XFS, ZFS, and Btrfs also use delayed allocation (Table 2).

Tabelle 2: Overview of Functional Filesystem Differences

ext3

ext4

XFS

Btrfs

Production-ready

X

X

X

Partially

Utilities package

e2fsprogs

xfsprogs

btrfs-progs

Filesystem utilities

mke2fs, resize2fs, e2fsck, tune2fs

mkfs.xfs, xfs_growfs, xfs_repair, xfs_admin

mkfs.btrfs, btrfs resize, btrfsck, btrfs filesystem

Maximum filesystem size

16TiB

1EiB

16EiB

16EiB

Maximum file size:

2TiB

1EiB

8EiB

8EiB

Expand on the fly

X

X

X

X

Shrink on the fly

X

Expand offline

X

X

Shrink offline

X

X

Discard (ATA trim) [5]

X

X

X

X

Metadata CRC [6]

X

X

X

X

Data CRC

X

Snapshots/clones/internal RAID/compression

X

ext4

As the successor to ext3, ext4 is one of the most popular Linux filesystems. Although ext3 is slowly reaching its limits, with a maximum filesystem size of 16 tebibytes (TiB; slightly more than 16TB), ext4 provides sufficient space for many years with up to 1 exbibyte (EiB) capacity.

To create a new ext4 filesystem, you need an unused block device. You can simply use a spare partition (e.g., /dev/sdb1 if you have created an unused partition on the second disk) or an LVM logical volume. In the following examples, we use a logical volume (/dev/vg00/ext4fs), which means we can also expand and shrink the filesystem.

With root privileges, run mkfs.ext4 to create the new filesystem:

mkfs.ext4 /dev/vg/00/ext4fs

A newly created ext4 filesystem requires that all inode tables and the journal do not contain data. The corresponding areas must therefore be reliably overwritten with zeros. This may take a fair amount of time for larger filesystems, especially with hard drives; however, to let you use a new filesystem as soon as possible, the ext4 developers have implemented what they refer to as "lazy initialization," or initialization that occurs not when you create a filesystem, but in the background when you first mount the filesystem. Little wonder then that you suddenly notice I/O activity on mounting a new filesystem.

Caution is therefore advised if you want to perform performance tests with a newly created filesystem. In such cases, you should not create the filesystem with lazy initialization; instead, you should use the following parameters:

mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vg00/ext4fs

To mount the filesystem, create an appropriate mountpoint up front and then run the mount command:

mkdir /mnt/ext4fs
mount /dev/vg00/ext4fs /mnt/ext4fs

If you want to mount the new filesystem automatically at boot time, add a corresponding entry in the /etc/fstab file. You can optionally specify the -o parameter for the mount command (e.g., to mount a partition as read-only). For the list of possible options, see the kernel.org ext4 documentation [2]. Once the filesystem is mounted, /proc/mounts only shows a few options (rw,relatime,data=ordered) that need to run with the mount command or exist in /etc/fstab (e.g., errors = remount-ro) to be enabled:

# cat /proc/mounts | grep ext4
/dev/sda1 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/mapper/vg00-ext4fs /mnt/ext4fs ext4 rw,relatime,data=ordered 0 0

In addition to these options, however, other standard options are active. Since Linux kernel version 3.4, you can now also view options in the /proc filesystem. Listing 2 shows an example.

Listing 2: /proc Filesystem Info

# cat /proc/fs/ext4/sda1/options
rw
delalloc
barrier
user_xattr
acl
resuid=0
resgid=0
errors=remount-ro
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

Filesystem Check

After completing the most important setup steps, the advanced administration activities start with a filesystem check. When you run a check, the corresponding ext4 filesystem must not be mounted. You simply run the check using the e2fsck program; as an alternative, you can also use the symbolic link fsck.ext4. If the filesystem was not properly unmounted, the check terminates; alternatively, you can force validation with the -f parameter.

Expanding and Shrinking

Before expanding or shrinking the filesystem, you should make a backup. If problems arise unexpectedly, this reduces the risk of data loss.

You can expand an ext4 filesystem in an LVM environment in a single step using the lvextend command. The prerequisite is that the corresponding LVM volume group still have enough free disk space. If you add the -r option, after the LVM logical volume expands, the ext4 resize2fs command is executed, thus expanding the underlying filesystem as well. Expansion is also possible on the fly, which means the filesystem can remain mounted.

You can also shrink an ext4 filesystem in an LVM environment in a single step using the lvreduce command. Again the -r switch causes resize2fs to be run before shrinking the LVM logical volume:

umount /mnt/ext4fs
lvreduce -L -10G -r /dev/vg00/ext4fs

Shrinking is only possible offline; the filesystem must not be mounted.

Customizing ext4

With the ext4 tune2fs tool, you can view (with -l switch) and tweak all the adjustable parameters of an ext4 filesystem. A point of interest is the reserved block count, which indicates how many blocks of the filesystem are reserved for files belonging to the root user. On a filesystem that is filling up, for example, this ensures that logfiles can still be written. By default, the reserve is five percent of the filesystem size, which is reasonable for the root filesystem. For other uses, a smaller reserve is fine, and for a filesystem for backups, you can set the reserve to zero percent:

tune2fs -m 0 /dev/mapper/vg00-ext4fs

as shown in the preceding command.

Filesystem in Userspace

The previous section dealt with block-based filesystems, but the Linux storage stack also offers other types of filesystems. Filesystems in userspace (FUSE) is an interesting option that lets users create filesystems in userspace that do without root privileges and kernel code.

For a long time, this simple approach to managing a filesystem in userspace was not possible. Until FUSE hit the scene, filesystems had to be implemented in the kernel, with all the complexity that entails. This prompted the development of a kernel module (fuse.ko) that fields virtual filesystem requests and a library (libfuse) that passes them on to the filesystem in userspace.

If you have an innovative idea for a new filesystem, your best bet is to use FUSE. For easier programming in userspace, you have countless API bindings, from C, through Perl and Python, to Ruby, at your disposal. Before you start a new FUSE project, however, take a look at existing FUSE implementations, including prominent representatives such as NTFS-3G, EncFS, SSHFS, or ZFS.

SSHFS for Remote Filesystems

Mounting a filesystem locally via SSH is not rocket science with sshfs. You don't even need root privileges because it is a Filesystem in Userspace. As shown in Listing 3, a call to sshfs is sufficient to mount the /home/tktest directory locally on server 192.168.56.105. You can work with the target directory as with any other normal directory, even though it is on the remote server.

Listing 3: Mount /home/tktest Directory

$ sshfs tktest@192.168.56.105:/home/tktest ./sshdir/
tktest@192.168.56.105's password:
$ cat /proc/mounts | grep ssh
tktest@192.168.56.105:/home/tktest /home/user/tmp/sshdir fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000 0 0

Conclusion

As an alternative to the simple block devices featured so far, you can create stackable block devices that offer additional features, such as the Logical Volume Manager (LVM) already discussed, software RAID (MD/RAID), a distributed replicated block device (DRBD), or device mapper targets.