RAID and filesystems

From Linux Raid Wiki
Jump to: navigation, search

Contents

Raid layout

In order to work efficiently, file systems need to understand the disk structure they are running on. File system tuning is somewhat of an arcane art, and the different variants of mkfs usually contain code to optimise the file system layout to the underlying disks. If the file system is over LVM and/or RAID, the code looks at the setup and optimises the layout.

The important figures to note are the stripe unit and the stripe width. The stripe unit is the size of the data written per disk. This is usually thought of as a multiple of 512 bytes, as that was typically a single block on a disk. A single disk block is now often 4K. But a typical stripe unit will now be of the order of a megabyte. The stripe width is the number of data blocks in a stripe.

For raids 5 & 6, the stripe width is disks-minus-one or disks-minus-two.

For raid 1, the stripe width is 1.

For raid 10, the picture seems rather more complicated.

Optimal Stripe Unit

Choosing the optimal stripe unit is a trade off. Smaller stripe units means a read can be split across multiple drives, with the resulting increase in bandwidth. But this then collides with read-ahead, where the OS or drive may retrieve more than was requested in the expectation that it will be requested soon. Larger stripe units may help writes by reducing the number of disks that need to be written to.

Many file systems have have a block size which they use to allocate disk space. Files tend to accumulate at the start of these blocks, so you want to choose a stripe unit such that the stride width and block sizes do not fit neatly into each other. This then ensures that the files are spread evenly across the disks and reduces the risk of "hot spots" where data accumulates on some drives and empty space on others.

File System impact

BTRFS

btrfs has the "btrfs-balance" command. I can't make out whether this will remove hot spots when the underlying raid device changes, but a plain rebalance will rewrite the entire file system, so it looks like it will.

When reading the man pages, be careful because there are various references to raid but this will refer to btrfs's internal raid levels, not an underlying MD device. Be warned - parity raid in btrfs is apparently still experimental (2018) and prone to eat data.

EXT

The mount command for ext4 has the "stripe" option. This is the number of data disks times the number of blocks per chunk, ie the size of a stripe in disk blocks. There are no equivalent options for optimising earlier ext versions to raid.

However, mkfs.ext2 and tune2fs have the options "stride" and "stripe_width". "stride" is the size (in blocks) of a chunk on a disk, and "stripe_width" is the number of blocks across the array, ie stride times number of data disks.

Given that as much as possible is automated, it is likely that mkfs detects that the underlying device is MD, and sets these automatically on file system creation. It's up to you to update them when the array changes.

When a raid is resized, these should all obviously be updated, which will result in all new writes being optimised, but of course existing data on disk is likely to suffer from hotspots.

XFS

XFS is a high-performance file system. It is strongly recommended not to reshape the raid; creating a new array with the same number of data disks and adding that with LVM is the recommended approach.

Both mkfs and mount allow you to specify or over-ride the the stripe unit and stripe width used by the file system. These are used with mkfs usually when using a hardware raid (where the geometry is not visible to mkfs), and with mount when the geometry has been altered by for example a reshape. It is not clear to me whether using these options with mount updates the default or whether they have to be specified every time. But specifying them in fstab isn't a problem.

XFS does not have a mechanism at present for rebalancing the layout, so if you reshape an array underneath it you will almost certainly create hotspots and damage performance. Updating stripe unit and stripe width ensures that new writes are optimised to the new layout so new hotspots will not be created, and as the data is rewritten existing hot spots will be removed, but performance will be reduced unless and until the file system is completely rewritten.

Personal tools