A guide to mdadm

Back to Converting an existing system Forward to Scrubbing the drives

This page is an overview of mdadm. It is NOT intended as a replacement for the man pages - anything covered in detail there will be skimmed over here. This is meant to provide examples that you can adapt for yourselves.

Overview

mdadm has replaced all the previous tools for managing raid arrays. It manages nearly all the user space side of raid. There are a few things that need to be done by writing to the /proc filesystem, but not much.

Getting mdadm

This is a pretty standard part of any distro, so you should use your standard distro software management tool. If, however, you are having any problems it does help to be running the absolute latest version, which can be downloaded with

git clone git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git

or from

https://kernel.org/pub/linux/utils/raid/mdadm/

In the absence of any other preferences, it belongs in the /usr/local/src directory. As a linux-specific program there is none of this autoconf stuff - just follow the instructions as per the INSTALL file.

Modes

mdadm has seven modes. You will normally only use a few of them. They are as follows:-

Assemble

This is probably the mode that is used most, but you won't be using it much - it happens in the background. Every time the system is booted, this needs to run. It scans the drives, looking for superblocks, and rebuilds all the arrays for you. This is why you need an initramfs when booting off a raid array - because mdadm is a user-space program, if root is on an array then we have a catch-22 - we can't boot until we have root, and we can't have root until we've booted and can run mdadm.

Create

This is the first of the two modes you will use a lot. As the name implies, it creates arrays, and writes the superblocks for arrays that have them. It also fires off initialisation - making sure that the disks of a mirror are identical, or that on a parity array the parities are correct. This is why raids 5&6 are created in degraded mode - if they weren't then any check of the raid would spew errors for areas that hadn't been written to.

Grow

A bit of a misnomer, this mode takes care of all operations that changed the size of an array, such as changing the raid level, changing the number of active devices, etc.

Manage

This is the default mode, and is used primarily to add and remove devices. It can be confusing, as several options (such as --add) are also used in grow mode, most typically when adding a device at the same time as changing the number of devices.

Follow or Monitor

Build

This is a relic of when superblocks didn't exist. It is used to (re)create an array, and should not be used unless you know exactly what you are doing. Because there are no superblocks, or indeed, any array metadata, the system just has to assume you got everything right because it has no way of checking up on you.

[TODO: Can you use this mode to create a temporary mirror, for the purpose of backing up your live data?]

Misc

This contains all the bits that don't really fit anywhere else.

Array internals and how it affects mdadm

The first arrays did not have a superblock, and were declared in mdadm.conf. This obviously could lead to a disaster if drives were moved between machines, as they were identified by their partition type. Adding a new drive could move the device names around and, relying on mdadm.conf, when the array was assembled it could get the wrong drives in the wrong place.

Presumably to fix this, the 0.9 version superblock was defined, stored at the end of the device. It's also referred to as the version 0 superblock, 0 referring to the internals of the superblock. However, it lacks support for most of the modern features of mdadm, and is now obsolete. It is also ambiguous, leading to sysadmin confusion, never a good idea. Arrays with a v0.9 superblock can be assembled by the kernel but again, if drives were moved between machines, the consequences could be disastrous as the array would be assembled incorrectly.

To fix this, a new version of the superblock was defined, version 1. The layout is common across all subversions, 1.0, 1.1 and 1.2. Autoassembly by the kernel was abandoned at this point - you must run an initramfs to boot off any v1 superblock other than a v1.0 mirror. v1 includes an array id which references the machine that created the array (eg ashdown:0) thus preventing drives being assembled into the wrong array (absent some pathological configuration).

Version 1.0 is also stored at the end of the device. This means that 0.9 can be upgraded to 1.0. It also means that, now that raid assembly is no longer supported in the kernel, the only supported way to boot from raid without an initramfs is to use a v1.0 mirror.

Version 1.1 is stored at the start of the device. This is not the best of places, as a wayward fdisk (or other programs) sometimes writes to the start of a disk and could destroy the superblock.

Version 1.2 is stored 4K from the start of the device.

Both 1.1 and 1.2 use the same algorithms to calculate the spare space left at the start of the device. Version 1.0 will leave 128K at the end of each device if it is large enough - "large enough" being defined as 200GB.

mdadm is unable to move the superblock, so there is no way of converting between the different version 1s.

There are also two other superblock formats, ddf and imsm. These are "industry standard", not linux specific, and aren't being covered in the 2016 rewrite.

These superblocks also define a "data offset". This is the gap between the start of the device, and the start of the data. This means that v1.2 will always have 4K at least per device. This space can be used for all sorts of things, typically the write-intent bitmap, the bad blocks log, and the backup area when reshaping an array. It is usually calculated automatically, but can be over-ridden.

All operations that involve moving data around are called reshapes, and require some form of backup. One of the reasons for the v1 superblock and the data offset is to provide spare space for this. Typically, the new layout will move the data offset, pushing the data block up or down. This means that each new stripe in turn is written into spare space, and then the watermark is moved so the new data is used and the old copy is freed. If the reshape is interrupted, the write that was in progress can safely be abandoned and restarted.

If the offset cannot be moved, then a backup file must be provided, on a different partition. This has a whole host of downsides, not least that the reshape cannot restart automatically if interrupted, and that it requires twice as much disk io, as each stripe in turn needs to be backed up. At worst, it's easy to lose the array, because if the backup file is accidentally put on the array being reshaped, you get a catch-22. In order to mount the partition and access the backup file, the array needs to be running. But you can't start the array until after the reshape has restarted. And in order to restart the reshape, you need access to the backup file.

When reshaping an array, mdadm will lock the stripe being updated and rewrite it. Any reads or writes to the array will see if they are above or below the stripe being reshaped, and use the appropriate algorithm. If they hit the stripe being rewritten, they have to wait. When adding a device, the reshape will start at the beginning of the array, because after the first few stripes a gap will open up on disk and the new stripe will not overwrite the old one. If removing a device, it will start at the top and move down because this starts with a gap, which shrinks until only at the very end it starts overwriting the strip being rewritten.

Cookbook

Assembling your arrays

mdadm --assemble --scan

This is the command that runs in the background at boot, and assembles and runs all your arrays (unless something goes wrong, when you usually end up with a partially assembled array. This can be a right pain if you don't realise that's what's happened).

Creating an array

Creating a mirror raid

The simplest example of creating an array, is creating a mirror.

mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 --level=1 --raid-devices=2

This will copy the contents of sda1 to sdb1 and give you a clean array. There is no reason why you can't use the array while it is copying (resyncing). This can be suppressed with the "--assume-clean" option, but you should only do this if you know the partitions have been wiped to null beforehand. Otherwise, the dead space will not be a mirror, and any check command will moan blue murder.

Creating a parity raid

Now let's create a more complicated example.

mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 /dev/sdc1 --level=5 --raid-devices=3 --bitmap=internal

This, unsurprisingly, creates a raid 5 array. When creating the array, you must give it exactly the the number of devices it expects, ie 3 here. So two of the drives will be assembled into a degraded array, and the third drive will be resync'd to fix the parity. If you want to add a fourth drive as spare, this must be done later. An internal bitmap has been declared, so the superblock will keep track of which blocks have been updated and which blocks need to be updated. This means that if a drive gets kicked, for some reason, it can be re-added back without needing a total resync.

A bitmap will be created by default if the array is over 100GB in size. Note that this is a fairly recent change, and if you are running on an old kernel you may have to delete the bitmap if you wish to use many of the "grow" options.

The raid by default will be created in degraded mode and will resync. This is because, unless all your drives are blank (just like creating a mirror) any integrity check will moan blue murder that the unused parts of your array contain garbage and the parity is wrong.

Growing an array

BACK UP. BACK UP !! BACK UP !!!!

You should not lose data - mdadm is designed to fail safe, and even when things go completely pear-shaped, the array should still assemble and run, letting you recover if the worst comes to the worst.

Note also that, if you do not have a modern kernel, these commands may fail with an error "Bitmap must be removed before size/shape/level can be changed".

Adding a drive to a mirror

This will add a new drive to your mirror. The "--grow / --raid-devices" is optional, if you increase the number of raid devices, the new drive will become an active part of the array and the existing drives will mirror across. If you don't increase the number of raid devices, the new drive will be a spare, and will only become part of the active array if one of the other drives fails.

mdadm [--grow] /dev/md/mirror --add /dev/sdc1 [--raid-devices=3]

Upgrading a mirror raid to a parity raid

The following commands will convert a two-disk mirror into a degraded two-disk raid5, and then add the third disk for a fully functional raid5 array. Note that the first command will fail if run on the array we've just grown in the previous section if you changed the number of raid-devices to three. If you have two devices and a spare the first command on its own will work. I haven't investigated whether more than two active devices plus a spare will work.

mdadm --grow /dev/md/mirror --level=5
mdadm --grow /dev/md/mirror --add /dev/sdc1 --raid-devices=3

Removing a disk from an array

This will convert the mirror from the first section into a degraded three-disk mirror, and then into a healthy two-disk mirror. Note that using OpenSUSE Leap 42 I had problems reducing the device count to 2.

mdadm /dev/md/mirror --fail /dev/sdc1 --remove /dev/sdc1
mdadm --grow /dev/md/mirror --raid-devices=2

Adding more space without adding another device

You might not have sufficient SATA ports to add any more disk drives, but you want to add more space to your array. The procedure is pretty much the same as for replacing a dodgy device.

If you have a three-disk mirror, you can just fail a drive and replace it with a larger drive, and repeat until all your drives have been replaced with bigger drives. If you've only got a two-drive mirror, the earlier advice to get an add-in SATA PCI card or somesuch applies. Either add the new drive then fail the old one, or put the old one in a USB cage on a temporary basis and --replace it.

If you have a parity raid, then if you have a raid-6, again you can just fail a drive and then add in a new one, but this is not the best idea. If you don't want to place your data at risk, then for a raid-5 you *must*, and for a raid-6 you *should*, use a spare SATA port to do a --replace, or swap your old drive into a USB cage so you can do a --replace that way.

Once you've increased the size of all the underlying devices, you can increase the size of the array, and then increase the size of the filesystem or partitions in the array.

Managing an array

Changing the size of the array

Adding a new device to a mirror won't change the size of the array - the new device will store an extra copy of the data and there won't be any extra space. But adding a device to a parity array will normally increase the space available. Assuming we have a 3 by 4TB raid-5, that gives us an 8TB array. Adding a fourth 4TB will give us a 12TB array.

Assuming it's possible, adding a 3TB device would only give us an extra 1TB, as the formula for raid-5 is devices-1 times size-of-smallest-device. This is, however, unlikely to be possible. To do so, you would probably have to reduce the the size of the array to 6TB (--size=3TB) before adding the new drive (which would then increase capacity to 9TB).

Adding that fourth drive has increased the size of the array, which we can take advantage of by resizing the file system on it (see the next section). But what happens if we've added a 5TB drive to our three 4TB drives? Unfortunately, that extra 1TB is wasted because this drive is larger. So let's assume that rather than adding a fourth - 5TB - drive, we've actually replaced all 4TB drives to give us a 3 x 5TB drive. Unfortunately, we've still only got an 8TB array. We need to tell mdadm about the extra space available with

mdadm --grow /dev/mdN --size=max

The --size option tells the array how much of each disk to use. When an array is created it defaults to the size of the smallest drive but when, as here, we have replaced all the drives with larger drives we need to explicitly tell mdadm about the space. The "max" argument tells it, once again, to default to the size of the smallest drive currently in the array.

Note that there has been the odd report of --size=max not working. Make sure the kernel knows about the new disk size, as this was the problem - you may have to re-hot-add the drive or reboot the computer.

Using the new space

If the array was originally partitioned, the new space will now be available to resize existing partitions or add new ones. If the array had a filesystem on it, the filesystem can now be expanded. Different filesystems have different commands, for example

resize2fs /dev/mdN