A guide to mdadm

Revision as of 21:31, 18 October 2017

Back to Converting an existing system Forward to Scrubbing the drives

This page is an overview of mdadm. It is NOT intended as a replacement for the man pages - anything covered in detail there will be skimmed over here. This is meant to provide examples that you can adapt for yourselves.

Overview

mdadm has replaced all the previous tools for managing raid arrays. It manages nearly all the user space side of raid. There are a few things that need to be done by writing to the /proc filesystem, but not much.

Getting mdadm

This is a pretty standard part of any distro, so you should use your standard distro software management tool. If, however, you are having any problems it does help to be running the absolute latest version, which can be downloaded with

git clone git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git

or from

https://kernel.org/pub/linux/utils/raid/mdadm/

In the absence of any other preferences, it belongs in the /usr/local/src directory. As a linux-specific program there is none of this autoconf stuff - just follow the instructions as per the INSTALL file.

Modes

mdadm has seven modes. You will normally only use a few of them. They are as follows:-

Assemble

This is probably the mode that is used most, but you won't be using it much - it happens in the background. Every time the system is booted, this needs to run. It scans the drives, looking for superblocks, and rebuilds all the arrays for you. This is why you need an initramfs when booting off a raid array - because mdadm is a user-space program, if root is on an array then we have a catch-22 - we can't boot until we have root, and we can't have root until we've booted and can run mdadm.

Create

This is the first of the two modes you will use a lot. As the name implies, it creates arrays, and writes the superblocks for arrays that have them. It also fires off initialisation - making sure that the disks of a mirror are identical, or that on a parity array the parities are correct. This is why raids 5&6 are created in degraded mode - if they weren't then any check of the raid would spew errors for areas that hadn't been written to.

Grow

A bit of a misnomer, this mode takes care of all operations that change the size of an array, such as changing the raid level, changing the number of active devices, etc.

Manage

This is the default mode, and is used primarily to add and remove devices. It can be confusing, as several options (such as --add) are also used in grow mode, most typically when adding a device at the same time as changing the number of devices.

Follow or Monitor

Build

This is a relic of when superblocks didn't exist. It is used to (re)create an array, and should not be used unless you know exactly what you are doing. Because there are no superblocks, or indeed, any array metadata, the system just has to assume you got everything right because it has no way of checking up on you.

[TODO: Can you use this mode to create a temporary mirror, for the purpose of backing up your live data?]

Misc

This contains all the bits that don't really fit anywhere else.

Array internals and how it affects mdadm

Superblocks and Raid versions

The first arrays did not have a superblock, and were declared in "raidtab", which was managed by the now-defunct "raid-tools". This obviously could lead to a disaster if drives were moved between machines, as they were identified by their partition type. Adding a new drive could move the device names around and, relying on "raidtab", when the array was assembled it could get the wrong drives in the wrong place. Only linear and raid-0 were supported at this time, so there was no redundancy to recover from. mdadm still supports this sort of array with the --build option.

Presumably to fix this, the 0.9 version superblock was defined, stored at the end of the device. It's also referred to as the version 0 superblock, 0 referring to the internals of the superblock. However, it lacks support for most of the modern features of mdadm, and is now obsolete. It is also ambiguous, leading to sysadmin confusion, never a good idea. Arrays with a v0.9 superblock can be assembled by the kernel but again, if drives were moved between machines, the consequences could be disastrous as the array would be assembled incorrectly.

To fix this, a new version of the superblock was defined, version 1. The layout is common across all subversions, 1.0, 1.1 and 1.2. Autoassembly by the kernel was abandoned at this point - you must run an initramfs to boot off any v1 superblock other than a v1.0 mirror. v1 includes an array id which references the machine that created the array (eg ashdown:0) thus preventing drives being assembled into the wrong array (absent some pathological configuration).

Version 1.0 is also stored at the end of the device. This means that 0.9 can be upgraded to 1.0. It also means that, now that raid assembly is no longer supported in the kernel, the only supported way to boot from raid without an initramfs is to use a v1.0 mirror.

Version 1.1 is stored at the start of the device. This is not the best of places, as a wayward fdisk (or other programs) sometimes writes to the start of a disk and could destroy the superblock.

Version 1.2 is stored 4K from the start of the device.

Both 1.1 and 1.2 use the same algorithms to calculate the spare space left at the start of the device. Version 1.0 will leave 128K at the end of each device if it is large enough - "large enough" being defined as 200GB.

mdadm is unable to move the superblock, so there is no way of converting between the different version 1s.

There are also two other superblock formats, ddf and imsm. These are "industry standard", not linux specific, and aren't being covered in the 2016 rewrite.

These superblocks also define a "data offset". This is the gap between the start of the device, and the start of the data. This means that v1.2 must always have at least 4K per device, although it's normally several megabytes. This space can be used for all sorts of things, typically the write-intent bitmap, the bad blocks log, and a buffer space when reshaping an array. It is usually calculated automatically, but can be over-ridden.

All operations that involve moving data around are called reshapes, and require a temporary region (window) that can be written to without corrupting the array. One of the reasons for the v1 superblock and the data offset is to provide spare space for this. The preferred mechanism is to relocate the data up or down a few stripes, reading data from the live area and writing it into the window. Every few stripes, the superblock is updated, making the window "live" and freeing up the space that has just been relocated as the new window. This results in the data offset being shifted a few megabytes in either direction (which is one reason why you can NOT safely recreate a live array). If the system crashes or is halted during a reshape, it is assumed the window is corrupt, and the reshape can continue.

If the offset cannot be moved because there is no room or the superblock is v0.9, then a backup file must be provided, on a different partition. [TODO: Will changing the data offset prevent booting off a v1.0 mirror?] This has a whole host of downsides, not least that the reshape cannot restart automatically if interrupted, and that it requires twice as much disk io, as each stripe in turn needs to be backed up. At worst, it's easy to lose the array, because if the backup file is accidentally put on the array being reshaped, you get a catch-22. In order to mount the partition and access the backup file, the array needs to be running. But you can't start the array until after the reshape has restarted. And in order to restart the reshape, you need access to the backup file. When restarting after a crash [TODO: does it close cleanly in the event of a shutdown?] mdadm assumes the stripe being processed is corrupt and restores it from backup before proceeding.

Blocks, chunks and stripes

Changing these can be used to optimise the array. But the main way of doing so is to adjust the chunk size, the stripe size is controlled by the number of devices in the array, and the block size is dictated by the hardware.

Getting the correct block size is crucial. The block size has been 512 bytes (half a kilobyte) since the days of 8" floppies, if not before. But as of now (2017) the standard block size is now 4096 bytes (4K). Some drives are 4K only, and some have 512-4096 conversion layer. Getting this wrong is one of the biggest performance killers at the block layer - linux normally works in 4K blocks, and if linux thinks the drive starts counting 512-blocks blocks at 0, while the drive emulation starts counting at 1 (or vice versa), every time linux writes a 4K block, the emulation has to read 2 4K blocks, modify both, and write both back.

The chunk size is the number of consecutive blocks written to each drive. It's a multiple of the linux 4K block size.

The stripe size is the number of chunks by the number of drives. The stripe includes parity and/or mirror information so the data stored per stripe is usually less than the size of the stripe.

Near, Far, and offset layouts

These layouts apply primarily to raid 10, but the concept also applies to other raid layouts. Raid 10 stores the same chunk repeatedly across several drives. The near layout writes N copies of each block consecutively across X drives so we get

      | Device #1 | Device #2 | Device #3 |
 ------------------------------------------
 0x00 |     0     |     0     |     1     |
 0x01 |     1     |     2     |     2     |
   :  |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |

What is notable about this layout is that an array can be interpreted as several different variants of raid. If N == X, then this is also a raid-1 mirror. But if N == X == 2, we also have a raid-4 or raid-5 array with one data drive and one parity drive.

Offset layout is similar, except instead of repeating chunks, it repeats stripes with the second and subsequent copies shifted

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |     3     |
 0x01 |     1     |     2     |     3     |     0     |
 0x02 |     2     |     3     |     0     !     1     |
 0x03 |     4     |     5     |     6     |     7     |
 0x04 |     5     |     6     |     7     |     4     |
 0x05 |     6     |     7     |     4     |     5     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |

Like the near layout, this layout too can be reshaped. But any attempt to reshape a far array will be failed immediately. The difficulty with a far array is that because the copies are all over the array, it can be difficult to retain redundancy while reshaping. The array is effectively divided into as many parts as copies are required, and then each part holds one copy, suitably offset to ensure that any individual block is not duplicated on the same disk.

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |     3     |
 0x01 |     4     |     5     |     6     !     7     |
 0x02 |     8     |     9     |    10     !    11     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |
 0x00 |     1     |     2     |     3     |     0     |
 0x01 |     5     |     6     !     7     |     4     |
 0x02 |     9     |    10     !    11     |     8     |
 /pre>

== Cookbook ==

=== Assembling your arrays ===

 mdadm --assemble --scan

This is the command that runs in the background at boot, and assembles and runs all your arrays (unless something goes wrong, when you usually end up with a partially assembled array. This can be a right pain if you don't realise that's what's happened).

=== Creating an array ===

==== Creating a mirror raid ====

The simplest example of creating an array, is creating a mirror.

 mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 --level=1 --raid-devices=2

This will copy the contents of sda1 to sdb1 and give you a clean array. There is no reason why you can't use the array while it is copying (resyncing). This can be suppressed with the "--assume-clean" option, but you should only do this if you know the partitions have been wiped to null beforehand. Otherwise, the dead space will not be a mirror, and any check command will moan blue murder.

==== Creating a parity raid ====

Now let's create a more complicated example.

 mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 /dev/sdc1 --level=5 --raid-devices=3 --bitmap=internal

This, unsurprisingly, creates a raid 5 array. When creating the array, you must give it exactly the the number of devices it expects, ie 3 here. So two of the drives will be assembled into a degraded array, and the third drive will be resync'd to fix the parity. If you want to add a fourth drive as spare, this must be done later. An internal bitmap has been declared, so the superblock will keep track of which blocks have been updated and which blocks need to be updated. This means that if a drive gets kicked, for some reason, it can be re-added back without needing a total resync.

A bitmap will be created by default if the array is over 100GB in size. Note that this is a fairly recent change, and if you are running on an old kernel you may have to delete the bitmap if you wish to use many of the "grow" options.

The raid by default will be created in degraded mode and will resync. This is because, unless all your drives are blank (just like creating a mirror) any integrity check will moan blue murder that the unused parts of your array contain garbage and the parity is wrong.

=== Growing an array ===

BACK UP. BACK UP !! BACK UP !!!!

You should not lose data - mdadm is designed to fail safe, and even when things go completely pear-shaped, the array should still assemble and run, letting you recover if the worst comes to the worst. 

Note also that, if you do not have a modern kernel, these commands may fail with an error "Bitmap must be removed before size/shape/level can be changed".

==== Adding a drive to a mirror ====

This will add a new drive to your mirror. The "--grow / --raid-devices" is optional, if you increase the number of raid devices, the new drive will become an active part of the array and the existing drives will mirror across. If you don't increase the number of raid devices, the new drive will be a spare, and will only become part of the active array if one of the other drives fails.

 mdadm [--grow] /dev/md/mirror --add /dev/sdc1 [--raid-devices=3]

==== Upgrading a mirror raid to a parity raid ====

The following commands will convert a two-disk mirror into a degraded two-disk raid5, and then add the third disk for a fully functional raid5 array. Note that the first command will fail if run on the array we've just grown in the previous section if you changed the number of raid-devices to three. If you have two devices and a spare the first command on its own will work. I haven't investigated whether more than two active devices plus a spare will work.

 mdadm --grow /dev/md/mirror --level=5
 mdadm --grow /dev/md/mirror --add /dev/sdc1 --raid-devices=3

==== Removing a disk from an array ====

This will convert the mirror from the first section into a degraded three-disk mirror, and then into a healthy two-disk mirror. Note that using OpenSUSE Leap 42 I had problems reducing the device count to 2.

 mdadm /dev/md/mirror --fail /dev/sdc1 --remove /dev/sdc1
 mdadm --grow /dev/md/mirror --raid-devices=2

If ever you happen to have already removed a disk from a three disk mirror to a two disks mirror, use only the second line (grow) to fix the degraded mode (tested on openSUSE 42.2). It can be verified with:

 mdadm --detail /dev/mdxxx

==== Adding more space without adding another device ====

You might not have sufficient SATA ports to add any more disk drives, but you want to add more space to your array. The procedure is pretty much the same as for replacing a dodgy device.

If you have a three-disk mirror, you can just fail a drive and replace it with a larger drive, and repeat until all your drives have been replaced with bigger drives. If you've only got a two-drive mirror, the earlier advice to get an add-in SATA PCI card or somesuch applies. Either add the new drive then fail the old one, or put the old one in a USB cage on a temporary basis and --replace it.

If you have a parity raid, then if you have a raid-6, again you can just fail a drive and then add in a new one, but this is not the best idea. If you don't want to place your data at risk, then for a raid-5 you *must*, and for a raid-6 you *should*, use a spare SATA port to do a --replace, or swap your old drive into a USB cage so you can do a --replace that way.

Once you've increased the size of all the underlying devices, you can increase the size of the array, and then increase the size of the filesystem or partitions in the array.

=== Managing an array ===

==== Changing the size of the array ====

Adding a new device to a mirror won't change the size of the array - the new device will store an extra copy of the data and there won't be any extra space. But adding a device to a parity array will normally increase the space available. Assuming we have a 3 by 4TB raid-5, that gives us an 8TB array. Adding a fourth 4TB will give us a 12TB array.

Assuming it's possible, adding a 3TB device would only give us an extra 1TB, as the formula for raid-5 is devices-1 times size-of-smallest-device. This is, however, unlikely to be possible. To do so, you would probably have to reduce the the size of the array to 6TB (--size=3TB) before adding the new drive (which would then increase capacity to 9TB).

Adding that fourth drive has increased the size of the array, which we can take advantage of by resizing the file system on it (see the next section). But what happens if we've added a 5TB drive to our three 4TB drives? Unfortunately, that extra 1TB is wasted because this drive is larger. So let's assume that rather than adding a fourth - 5TB - drive, we've actually replaced all 4TB drives to give us a 3 x 5TB drive. Unfortunately, we've still only got an 8TB array. We need to tell mdadm about the extra space available with

 mdadm --grow /dev/mdN --size=max

The --size option tells the array how much of each disk to use. When an array is created it defaults to the size of the smallest drive but when, as here, we have replaced all the drives with larger drives we need to explicitly tell mdadm about the space. The "max" argument tells it, once again, to default to the size of the smallest drive currently in the array.

Note that there has been the odd report of --size=max not working. Make sure the kernel knows about the new disk size, as this was the problem - you may have to re-hot-add the drive or reboot the computer.

==== Using the new space ====

If the array was originally partitioned, the new space will now be available to resize existing partitions or add new ones. If the array had a filesystem on it, the filesystem can now be expanded. Different filesystems have different commands, for example

 resize2fs /dev/mdN

{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
|- padding:5px;padding-top:0.5em;font-size: 95%; 
| Back to [[Converting an existing system]] <span style="float:right; padding-left:5px;">Forward to [[Scrubbing the drives]]</span>
|}