Managing Linux filesystems
Rank and File
Imagine a filesystem as a library that stores data efficiently and in a structured way. Without filesystems, persistent data would not be possible. Virtually every Linux system has at least one block-based filesystem (e.g., ext4, XFS, Btrfs). Block-based means that an underlying physical data store is involved, such as a hard drive, solid-state drive (SSD), or SD card. Linux has a number of filesystems from which to choose, and the ext2/3/4 series is likely known by everyone. If you work with a current distribution, you have probably met other filesystems, too (Table 1).
Tabelle 1: Standard Filesystems
Distribution |
Filesystem |
---|---|
Debian (from v7.0 wheezy) |
ext4 |
Ubuntu (from v9.04) |
ext4 |
Fedora (from v22) |
XFS |
SLES (from v12) |
Btrfs for the root partition, XFS for data partitions |
RHEL 7 |
XFS |
Most filesystems are very similar and differ only in detail. The following terms will help you understand them:
- Superblock: Stores metadata about a filesystem, such as the total number of blocks and inodes, block sizes, UUIDs, and timestamps.
- Inode: An index node, which comprises metadata associated with a file, such as permissions, owners, timestamps, and so on. In addition to this descriptive information, an inode can contain direct extents (data) or refer to another inode.
- Extents: An area of storage reserved for a file. Older filesystems used direct and indirect blocks to reference blocks of data, whereas modern filesystems use a more efficient method with extents [1]. Extent mapping is a more efficient way to map logical filesystem blocks to physical blocks.
- Journaling: A method of tracking changes that have not yet been committed to the filesystem. A journal comes into its own in exceptional situations, such as during the recovery of filesystems that have crashed (e.g., because of a sudden power failure). Journaling ensures filesystem consistency, because operations recorded in the journal are either performed in full or not at all. With this information, you can get back to a consistent state faster without having to go through a lengthy filesystem check.
From RAM to Persistent Memory
Random access memory (RAM) has speed advantages over hard drives and SSDs; therefore, the Linux kernel uses a caching mechanism that keeps data in RAM to reduce disk access. This cache is known as the page cache; running the free
command reveals its current size (Listing 1). At first glance, 2.7GB of 7.7GB of RAM is available to the system. If the RAM usage for the page cache is deducted, then actually 5.6GB is free. The page cache thus occupies 2.7GB (cached column). The buffers column also belongs to the page cache; buffers is where cached filesystem metadata resides.
Listing 1: Free Space
$ free -h total used free shared buffers cached Mem: 7.7G 4.9G 2.7G 228M 203M 2.7G -/+ buffers/cache: 2.1G 5.6G Swap: 1.0G 0B 1.0G
The page cache consists of physical pages in RAM, whose data pages are associated with a block device. The page cache size is always dynamic, because it uses any RAM that is not being used by the operating system. If the system suffers from high memory consumption, the page cache size is reduced, freeing up memory for applications.
The page cache is a write-back cache, which means it buffers both read and write data. A read from the block device propagates the data to the cache, which is then passed to the application. A write access lands directly in the page cache and not immediately on the block device. Data pages modified while in the page cache are called "dirty pages," because the modified data has not yet been written to persistent storage. Gradually, the Linux kernel writes data from RAM to the block device.
In addition to periodically writing data through the kernel, ext4 explicitly synchronizes its data and metadata using an interval of five seconds by default. You can change the sync time if necessary with the commit
option to the mount
command (see the ext4 documentation at kernel.org [2]). In the worst case, the data still in the RAM is lost in a sudden power outage. The longer the commit interval, the greater the risk of data loss.
The use of RAM as a cache provides huge performance advantages for the user. Don't forget, however, that RAM is volatile and not persistent. This fact forced itself into the awareness of many ext4 users recently when the "data corruption caused by unwritten and delayed extents" bug caused a stir [3]. On ext4, ephemeral files may never even reach the block device [4] under certain circumstances because of "delayed allocation."
Unlike ext3, ext4 delays allocating physical write blocks so the filesystem can accumulate data and allocate contiguous blocks later. This method gains the user a speed advantage when reading and writing the data while in RAM. Because ext4 cannot write unallocated blocks, they depend on the kernel to flush them out, which can translate to minutes in RAM instead of five seconds. Ext4 is not the only filesystem that uses this acceleration action: XFS, ZFS, and Btrfs also use delayed allocation (Table 2).
Tabelle 2: Overview of Functional Filesystem Differences
ext3 |
ext4 |
XFS |
Btrfs |
|
---|---|---|---|---|
Production-ready |
X |
X |
X |
Partially |
Utilities package |
|
|
|
|
Filesystem utilities |
|
|
|
|
Maximum filesystem size |
16TiB |
1EiB |
16EiB |
16EiB |
Maximum file size: |
2TiB |
1EiB |
8EiB |
8EiB |
Expand on the fly |
X |
X |
X |
X |
Shrink on the fly |
– |
– |
– |
X |
Expand offline |
X |
X |
– |
– |
Shrink offline |
X |
X |
– |
– |
Discard (ATA trim) [5] |
X |
X |
X |
X |
Metadata CRC [6] |
X |
X |
X |
X |
Data CRC |
– |
– |
– |
X |
Snapshots/clones/internal RAID/compression |
– |
– |
– |
X |
ext4
As the successor to ext3, ext4 is one of the most popular Linux filesystems. Although ext3 is slowly reaching its limits, with a maximum filesystem size of 16 tebibytes (TiB; slightly more than 16TB), ext4 provides sufficient space for many years with up to 1 exbibyte (EiB) capacity.
To create a new ext4 filesystem, you need an unused block device. You can simply use a spare partition (e.g., /dev/sdb1
if you have created an unused partition on the second disk) or an LVM logical volume. In the following examples, we use a logical volume (/dev/vg00/ext4fs
), which means we can also expand and shrink the filesystem.
With root privileges, run mkfs.ext4
to create the new filesystem:
mkfs.ext4 /dev/vg/00/ext4fs
A newly created ext4 filesystem requires that all inode tables and the journal do not contain data. The corresponding areas must therefore be reliably overwritten with zeros. This may take a fair amount of time for larger filesystems, especially with hard drives; however, to let you use a new filesystem as soon as possible, the ext4 developers have implemented what they refer to as "lazy initialization," or initialization that occurs not when you create a filesystem, but in the background when you first mount the filesystem. Little wonder then that you suddenly notice I/O activity on mounting a new filesystem.
Caution is therefore advised if you want to perform performance tests with a newly created filesystem. In such cases, you should not create the filesystem with lazy initialization; instead, you should use the following parameters:
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vg00/ext4fs
To mount the filesystem, create an appropriate mountpoint up front and then run the mount
command:
mkdir /mnt/ext4fs mount /dev/vg00/ext4fs /mnt/ext4fs
If you want to mount the new filesystem automatically at boot time, add a corresponding entry in the /etc/fstab
file. You can optionally specify the -o
parameter for the mount
command (e.g., to mount a partition as read-only). For the list of possible options, see the kernel.org ext4 documentation [2]. Once the filesystem is mounted, /proc/mounts
only shows a few options (rw
,relatime
,data=ordered
) that need to run with the mount command or exist in /etc/fstab
(e.g., errors = remount-ro
) to be enabled:
# cat /proc/mounts | grep ext4 /dev/sda1 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0 /dev/mapper/vg00-ext4fs /mnt/ext4fs ext4 rw,relatime,data=ordered 0 0
In addition to these options, however, other standard options are active. Since Linux kernel version 3.4, you can now also view options in the /proc
filesystem. Listing 2 shows an example.
Listing 2: /proc Filesystem Info
# cat /proc/fs/ext4/sda1/options rw delalloc barrier user_xattr acl resuid=0 resgid=0 errors=remount-ro commit=5 min_batch_time=0 max_batch_time=15000 stripe=0 data=ordered inode_readahead_blks=32 init_itable=10 max_dir_size_kb=0
Filesystem Check
After completing the most important setup steps, the advanced administration activities start with a filesystem check. When you run a check, the corresponding ext4 filesystem must not be mounted. You simply run the check using the e2fsck
program; as an alternative, you can also use the symbolic link fsck.ext4
. If the filesystem was not properly unmounted, the check terminates; alternatively, you can force validation with the -f
parameter.
Expanding and Shrinking
Before expanding or shrinking the filesystem, you should make a backup. If problems arise unexpectedly, this reduces the risk of data loss.
You can expand an ext4 filesystem in an LVM environment in a single step using the lvextend
command. The prerequisite is that the corresponding LVM volume group still have enough free disk space. If you add the -r
option, after the LVM logical volume expands, the ext4 resize2fs
command is executed, thus expanding the underlying filesystem as well. Expansion is also possible on the fly, which means the filesystem can remain mounted.
You can also shrink an ext4 filesystem in an LVM environment in a single step using the lvreduce
command. Again the -r
switch causes resize2fs
to be run before shrinking the LVM logical volume:
umount /mnt/ext4fs lvreduce -L -10G -r /dev/vg00/ext4fs
Shrinking is only possible offline; the filesystem must not be mounted.
Customizing ext4
With the ext4 tune2fs
tool, you can view (with -l
switch) and tweak all the adjustable parameters of an ext4 filesystem. A point of interest is the reserved block count, which indicates how many blocks of the filesystem are reserved for files belonging to the root user. On a filesystem that is filling up, for example, this ensures that logfiles can still be written. By default, the reserve is five percent of the filesystem size, which is reasonable for the root filesystem. For other uses, a smaller reserve is fine, and for a filesystem for backups, you can set the reserve to zero percent:
tune2fs -m 0 /dev/mapper/vg00-ext4fs
as shown in the preceding command.
Filesystem in Userspace
The previous section dealt with block-based filesystems, but the Linux storage stack also offers other types of filesystems. Filesystems in userspace (FUSE) is an interesting option that lets users create filesystems in userspace that do without root privileges and kernel code.
For a long time, this simple approach to managing a filesystem in userspace was not possible. Until FUSE hit the scene, filesystems had to be implemented in the kernel, with all the complexity that entails. This prompted the development of a kernel module (fuse.ko
) that fields virtual filesystem requests and a library (libfuse
) that passes them on to the filesystem in userspace.
If you have an innovative idea for a new filesystem, your best bet is to use FUSE. For easier programming in userspace, you have countless API bindings, from C, through Perl and Python, to Ruby, at your disposal. Before you start a new FUSE project, however, take a look at existing FUSE implementations, including prominent representatives such as NTFS-3G, EncFS, SSHFS, or ZFS.
SSHFS for Remote Filesystems
Mounting a filesystem locally via SSH is not rocket science with sshfs
. You don't even need root privileges because it is a Filesystem in Userspace. As shown in Listing 3, a call to sshfs
is sufficient to mount the /home/tktest
directory locally on server 192.168.56.105. You can work with the target directory as with any other normal directory, even though it is on the remote server.
Listing 3: Mount /home/tktest Directory
$ sshfs tktest@192.168.56.105:/home/tktest ./sshdir/ tktest@192.168.56.105's password: $ cat /proc/mounts | grep ssh tktest@192.168.56.105:/home/tktest /home/user/tmp/sshdir fuse.sshfs rw,nosuid,nodev,relatime,user_id=1000,group_id=1000 0 0
Conclusion
As an alternative to the simple block devices featured so far, you can create stackable block devices that offer additional features, such as the Logical Volume Manager (LVM) already discussed, software RAID (MD/RAID), a distributed replicated block device (DRBD), or device mapper targets.