A guide to mdadm

Revision as of 22:21, 19 October 2017

Back to Converting an existing system Forward to Scrubbing the drives

This page is an overview of mdadm. It is NOT intended as a replacement for the man pages - anything covered in detail there will be skimmed over here. This is meant to provide examples that you can adapt for yourselves.

Overview

mdadm has replaced all the previous tools for managing raid arrays. It manages nearly all the user space side of raid. There are a few things that need to be done by writing to the /proc filesystem, but not much.

Getting mdadm

This is a pretty standard part of any distro, so you should use your standard distro software management tool. If, however, you are having any problems it does help to be running the absolute latest version, which can be downloaded with

git clone git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git

or from

https://kernel.org/pub/linux/utils/raid/mdadm/

In the absence of any other preferences, it belongs in the /usr/local/src directory. As a linux-specific program there is none of this autoconf stuff - just follow the instructions as per the INSTALL file.

Modes

mdadm has seven modes. You will normally only use a few of them. They are as follows:-

Assemble

This is probably the mode that is used most, but you won't be using it much - it happens in the background. Every time the system is booted, this needs to run. It scans the drives, looking for superblocks, and rebuilds all the arrays for you. This is why you need an initramfs when booting off a raid array - because mdadm is a user-space program, if root is on an array then we have a catch-22 - we can't boot until we have root, and we can't have root until we've booted and can run mdadm.

Create

This is the first of the two modes you will use a lot. As the name implies, it creates arrays, and writes the superblocks for arrays that have them. It also fires off initialisation - making sure that the disks of a mirror are identical, or that on a parity array the parities are correct. This is why raids 5&6 are created in degraded mode - if they weren't then any check of the raid would spew errors for areas that hadn't been written to.

Grow

A bit of a misnomer, this mode takes care of all operations that change the size of an array, such as changing the raid level, changing the number of active devices, etc.

Manage

This is the default mode, and is used primarily to add and remove devices. It can be confusing, as several options (such as --add) are also used in grow mode, most typically when adding a device at the same time as changing the number of devices.

Follow or Monitor

Build

This is a relic of when superblocks didn't exist. It is used to (re)create an array, and should not be used unless you know exactly what you are doing. Because there are no superblocks, or indeed, any array metadata, the system just has to assume you got everything right because it has no way of checking up on you.

[TODO: Can you use this mode to create a temporary mirror, for the purpose of backing up your live data?]

Misc

This contains all the bits that don't really fit anywhere else.

Array internals and how it affects mdadm

Superblocks and Raid versions

The first arrays did not have a superblock, and were declared in "raidtab", which was managed by the now-defunct "raid-tools". This obviously could lead to a disaster if drives were moved between machines, as they were identified by their partition type. Adding a new drive could move the device names around and, relying on "raidtab", when the array was assembled it could get the wrong drives in the wrong place. Only linear and raid-0 were supported at this time, so there was no redundancy to recover from. mdadm still supports this sort of array with the --build option.

Presumably to fix this, the 0.9 version superblock was defined, stored at the end of the device. It's also referred to as the version 0 superblock, 0 referring to the internals of the superblock. However, it lacks support for most of the modern features of mdadm, and is now obsolete. It is also ambiguous, leading to sysadmin confusion, never a good idea. Arrays with a v0.9 superblock can be assembled by the kernel but again, if drives were moved between machines, the consequences could be disastrous as the array would be assembled incorrectly.

To fix this, a new version of the superblock was defined, version 1. The layout is common across all subversions, 1.0, 1.1 and 1.2. Autoassembly by the kernel was abandoned at this point - you must run an initramfs to boot off any v1 superblock other than a v1.0 mirror. v1 includes an array id which references the machine that created the array (eg ashdown:0) thus preventing drives being assembled into the wrong array (absent some pathological configuration).

Version 1.0 is also stored at the end of the device. This means that 0.9 can be upgraded to 1.0. It also means that, now that raid assembly is no longer supported in the kernel, the only supported way to boot from raid without an initramfs is to use a v1.0 mirror.

Version 1.1 is stored at the start of the device. This is not the best of places, as a wayward fdisk (or other programs) sometimes writes to the start of a disk and could destroy the superblock.

Version 1.2 is stored 4K from the start of the device.

Both 1.1 and 1.2 use the same algorithms to calculate the spare space left at the start of the device. Version 1.0 will leave 128K at the end of each device if it is large enough - "large enough" being defined as 200GB.

mdadm is unable to move the superblock, so there is no way of converting between the different version 1s.

There are also two other superblock formats, ddf and imsm. These are "industry standard", not linux specific, and aren't being covered in the 2016 rewrite.

These superblocks also define a "data offset". This is the gap between the start of the device, and the start of the data. This means that v1.2 must always have at least 4K per device, although it's normally several megabytes. This space can be used for all sorts of things, typically the write-intent bitmap, the bad blocks log, and a buffer space when reshaping an array. It is usually calculated automatically, but can be over-ridden.

All operations that involve moving data around are called reshapes, and require a temporary region (window) that can be written to without corrupting the array. One of the reasons for the v1 superblock and the data offset is to provide spare space for this. The preferred mechanism is to relocate the data up or down a few stripes, reading data from the live area and writing it into the window. Every few stripes, the superblock is updated, making the window "live" and freeing up the space that has just been relocated as the new window. This results in the data offset being shifted a few megabytes in either direction (which is one reason why you can NOT safely recreate a live array). If the system crashes or is halted during a reshape, it is assumed the window is corrupt, and the reshape can continue.

If the offset cannot be moved because there is no room or the superblock is v0.9, then a backup file must be provided, on a different partition. [TODO: Will changing the data offset prevent booting off a v1.0 mirror?] This has a whole host of downsides, not least that the reshape cannot restart automatically if interrupted, and that it requires twice as much disk io, as each stripe in turn needs to be backed up. At worst, it's easy to lose the array, because if the backup file is accidentally put on the array being reshaped, you get a catch-22. In order to mount the partition and access the backup file, the array needs to be running. But you can't start the array until after the reshape has restarted. And in order to restart the reshape, you need access to the backup file. When restarting after a crash [TODO: does it close cleanly in the event of a shutdown?] mdadm assumes the stripe being processed is corrupt and restores it from backup before proceeding.

Blocks, chunks and stripes

Changing these can be used to optimise the array. But the main way of doing so is to adjust the chunk size, the stripe size is controlled by the number of devices in the array, the linux block size is 4K, and the device block size is dictated by the hardware.

While not technically anything to do with raid, matching the linux and device block size is crucial. The device block size has been 512 bytes (half a kilobyte) since the days of 8" floppies, if not before. But as of now (2017) the standard device block size is now 4096 bytes (4K). Some drives are 4K only, and some have a 512-4096 conversion layer. Getting this wrong is one of the biggest performance killers at the device block layer - linux works in 4K blocks, and if linux thinks the drive starts counting 512-blocks blocks at 0, while the drive emulation starts counting at 1 (or vice versa), every time linux writes a 4K block, the emulation has to read 2 4K blocks, modify both, and write both back.

The chunk size is the number of consecutive blocks written to each drive. It's a multiple of the linux 4K block size. Note that some raids restrict your choice of block size. Raid-5 must be a power of two. Raid-0 doesn't care. Other raids may vary.

The stripe size is the number of chunks by the number of drives. The stripe includes parity and/or mirror information so the data stored per stripe is usually less than the size of the stripe.

You need to be careful when using these terms - just as block size may refer to linux or the device, the word "stripe" is also used in other ways. Inside the raid5 code it usually mean a page per drive. The word "strip" is also used to mean one page per device. So just make sure you are aware of context as it may alter the meaning.

Near, Far, and offset layouts

These layouts apply primarily to raid 10, but the concept also applies to other raid layouts. Raid 10 stores the same chunk repeatedly across several drives. The near layout writes N copies of each block consecutively across X drives so we get

      | Device #1 | Device #2 | Device #3 |
 ------------------------------------------
 0x00 |     0     |     0     |     1     |
 0x01 |     1     |     2     |     2     |
   :  |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |

What is notable about this layout is that an array can be interpreted as several different variants of raid. If N == X, then this is also a raid-1 mirror. But if N == X == 2, we also have a raid-4 or raid-5 array with one data drive and one parity drive.

Offset layout is similar, except instead of repeating chunks, it repeats stripes with the second and subsequent copies shifted

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |     3     |
 0x01 |     1     |     2     |     3     |     0     |
 0x02 |     2     |     3     |     0     !     1     |
 0x03 |     4     |     5     |     6     |     7     |
 0x04 |     5     |     6     |     7     |     4     |
 0x05 |     6     |     7     |     4     |     5     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |

Like the near layout, this layout too can be reshaped. Looking at the code, though, it appears that adding new drives is not permitted, although this should not be too difficult to implement.

But any attempt to reshape a far array will be failed immediately.

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |     3     |
 0x01 |     4     |     5     |     6     !     7     |
 0x02 |     8     |     9     |    10     !    11     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |
 0x10 |     1     |     2     |     3     |     0     |
 0x11 |     5     |     6     !     7     |     4     |
 0x12 |     9     |    10     !    11     |     8     |

The difficulty with a far array is that because the copies are all over the array, it can be difficult to retain redundancy while reshaping. The array is effectively divided into as many parts as copies are required, and then each part holds one copy, suitably offset to ensure that any individual block is not duplicated on the same disk. It will probably be very tricky to program a reshape, because it is very difficult to ensure that a change to one part of the array does not stomp over another part of the array.

"Grow"ing non-parity raid arrays

Raid 0

Raid-0 basically glues disks together. Adding a new disk to a linear array just extends the end of the array - the new space is tacked on the end of the array. Adding a new drive to a linear array is more complicated - because the data is spread across all the drives the array needs to be reshaped. With modern kernels and arrays, no backup file is needed as the new space will be used to store any necessary backup stripes. Also backup stripes will probably not be necessary as the data will be shifted up or down slightly within the devices that contain the array.

Removing a drive from a linear array is currently not possible. It would require moving all the data on or above the drive being removed out of the way to create a hole the size of the disk being removed.

For a striped array the data will need to be re-striped to move it off the disk being removed.

Raid-0 can be converted to raid-10, or to raid-4/5/6.

Raid 1

Raid 1 stores identical copies across all drives. As such, this has the simplest procedure for adding and removing drives. When a new drive is added, the data is copied across from one of the existing drives. When removing a drive, the drive can simply be deleted from the array.

Conversion to raid 4/5/6 is supported because a two-drive raid-1 is effectively a 1+parity raid-4/5, and raids 4, 5 and 6 differ solely in the number and placement of their parity blocks. The reverse transition is also possible, from a two-drive raid-5 to raid-1.

Raid 1+0

This is a raid 0, that is built up from raid-1 mirrors. As such, the individual raid elements can be grown or shrunk as above. There is also the inverted version, raid 0+1 which is a raid-1 mirror made from raid-0 arrays. Again the individual elements can be grown or shrunk as above.

Raid 10

Raid 10 is a special linux mode which stores multiple copies of the data across several drives. There must be at least as many drives as there are copies - if the two are equal it is equivalent to raid-1.

Far mode raid-10 is unique in that it scatters blocks all over the array, so it is very difficult to reshape and currently any attempt to do so will be rejected. It could be done if someone cares to code it.

Near and offset mode can have drives added and taken away. The array can also been swapped between near and offset modes, and the chunk size can be changed if it is within acceptable limits.

Conversion between raid-0 and raid-10 is supported - to convert to any other raid you will have to go via raid-0 - backup, BACKUP, BACKUP!!! The code does not appear to support conversion between raid-1 and raid-10 which should be easy for a two-device array.

"Grow"ing parity raid arrays

Raid 4

Raid 4 is rarely used. It really only makes sense if you have one fast drive (which you do not mind using for parity) and several slow ones. Each stripe consists of one or more data chunks followed by a parity chunk. The parity chunk is calculated using the XOR algorithm. If two bits being compared are identical, XOR returns 0, if they are different it returns 1.

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |    P1     |
 0x01 |     3     |     4     |     5     |    P2     |
 0x02 |     6     |     7     |     8     |    P3     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |

Raid 4 stores all the parity chunks on one device. Here P1 = 0 XOR 1 XOR 2, and P2 = 3 XOR 4 XOR 5. But because we have four devices (an even number) we know that P1 XOR 0 XOR 1 XOR 2 = 0, and likewise P2 XOR 3 XOR 4 XOR 5 = 0. If we had an odd number of devices, XOR'ing all the chunks together would give us all 1s.

This means that, should we lose any one chunk, we can easily recreate it by XOR'ing all the remaining chunks together.

It also means that, should we only have two devices, the data and parity chunks are identical, which gives us an array indistinguishable from a raid-1 mirror. It also gives us an array indistinguishable from a two-drive raid-5.

And mdadm will convert (with minor limitations) between raids 4, 5, and 6 without any difficulty.

Raid 5

Raid 5 is almost identical to raid-4. The difference is that the parity information is spread across all drives, not stored on just one. This spreads the load making the array more responsive and reducing wear on the parity drive.

      | Device #1 | Device #2 | Device #3 | Device #4 |
 ------------------------------------------------------
 0x00 |     0     |     1     |     2     |    P1     |
 0x01 |     3     |     4     |    P2     |     5     |
 0x02 |     6     |    P3     |     7     |     8     |
   :  |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |

Raid 6

The principle behind raid-6 is the same as that behind raids 5 and 4. The details, however, are markedly different. Because raid-6 uses more than one parity block, we cannot use the XOR algorithm as it does not provide the necessary mathematical independence. Instead, we use a mathematical construct called a Galois Field. This is explained very well in a blog by Igor Ostrovsky [1].

The mathematics allows the calculation of an arbitrary number of parity blocks, but linux raid only uses two, which we call P and Q. As with raid-5, the parity blocks are scattered amongst the devices, but there are six ways of scattering them. There is a four-way combination of left or right, symmetric or asymmetric, or there is parity_0 or parity_n.

      | Device #1 | Device #2 | Device #3 | Device #4 | Device #5 |
 ------------------------------------------------------------------
 0x00 |     0     |     1     |     2     |    P1     |    Q1     |
 0x01 |     3     |     4     |    P2     |    Q2     |     5     |
 0x02 |     6     |    P3     |    Q3     |     7     |     8     |
   :  |     :     |     :     |     :     |     :     |     :     |
   :  |     :     |     :     |     :     |     :     |     :     |

Because we have two mathematically independent parities, that means we can recover our data if we ever suffer two independent errors. In other words, if we lose two disks we can recalculate the two missing blocks. Or, if we suffer random corruption, if only one block is affected we can calculate which block it is, and what the correct value is. Every extra parity block enables us to recover an extra piece of damaged information.

Raid 6 is the only raid level where mdadm may complain that it cannot carry out a conversion directly between levels 4, 5 and 6. This is down to the way the parity blocks are scattered amongst the devices. The reshape is almost certainly possible, but it may require doing it in a couple of stages rather than directly.

Cookbook

Assembling your arrays

mdadm --assemble --scan

This is the command that runs in the background at boot, and assembles and runs all your arrays (unless something goes wrong, when you usually end up with a partially assembled array. This can be a right pain if you don't realise that's what's happened).

Creating an array

Creating a mirror raid

The simplest example of creating an array, is creating a mirror.

mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 --level=1 --raid-devices=2

This will copy the contents of sda1 to sdb1 and give you a clean array. There is no reason why you can't use the array while it is copying (resyncing). This can be suppressed with the "--assume-clean" option, but you should only do this if you know the partitions have been wiped to null beforehand. Otherwise, the dead space will not be a mirror, and any check command will moan blue murder.

Creating a parity raid

Now let's create a more complicated example.

mdadm --create /dev/md/name /dev/sda1 /dev/sdb1 /dev/sdc1 --level=5 --raid-devices=3 --bitmap=internal

This, unsurprisingly, creates a raid 5 array. When creating the array, you must give it exactly the the number of devices it expects, ie 3 here. So two of the drives will be assembled into a degraded array, and the third drive will be resync'd to fix the parity. If you want to add a fourth drive as spare, this must be done later. An internal bitmap has been declared, so the superblock will keep track of which blocks have been updated and which blocks need to be updated. This means that if a drive gets kicked, for some reason, it can be re-added back without needing a total resync.

A bitmap will be created by default if the array is over 100GB in size. Note that this is a fairly recent change, and if you are running on an old kernel you may have to delete the bitmap if you wish to use many of the "grow" options.

The raid by default will be created in degraded mode and will resync. This is because, unless all your drives are blank (just like creating a mirror) any integrity check will moan blue murder that the unused parts of your array contain garbage and the parity is wrong.

Growing an array

BACK UP. BACK UP !! BACK UP !!!!

You should not lose data - mdadm is designed to fail safe, and even when things go completely pear-shaped, the array should still assemble and run, letting you recover if the worst comes to the worst.

Note also that, if you do not have a modern kernel, these commands may fail with an error "Bitmap must be removed before size/shape/level can be changed".

Adding a drive to a mirror

This will add a new drive to your mirror. The "--grow / --raid-devices" is optional, if you increase the number of raid devices, the new drive will become an active part of the array and the existing drives will mirror across. If you don't increase the number of raid devices, the new drive will be a spare, and will only become part of the active array if one of the other drives fails.

mdadm [--grow] /dev/md/mirror --add /dev/sdc1 [--raid-devices=3]

Upgrading a mirror raid to a parity raid

The following commands will convert a two-disk mirror into a degraded two-disk raid5, and then add the third disk for a fully functional raid5 array. Note that the first command will fail if run on the array we've just grown in the previous section if you changed the number of raid-devices to three. If you have two devices and a spare the first command on its own will work. The code will reject any attempt to grow an array with more than two active devices.

mdadm --grow /dev/md/mirror --level=5
mdadm --grow /dev/md/mirror --add /dev/sdc1 --raid-devices=3

Removing a disk from an array

This will convert the mirror from the first section into a degraded three-disk mirror, and then into a healthy two-disk mirror. Note that using OpenSUSE Leap 42 I had problems reducing the device count to 2.

mdadm /dev/md/mirror --fail /dev/sdc1 --remove /dev/sdc1
mdadm --grow /dev/md/mirror --raid-devices=2

If ever you happen to have already removed a disk from a three disk mirror to a two disks mirror, use only the second line (grow) to fix the degraded mode (tested on openSUSE 42.2). It can be verified with:

mdadm --detail /dev/mdxxx

Adding more space without adding another device

You might not have sufficient SATA ports to add any more disk drives, but you want to add more space to your array. The procedure is pretty much the same as for replacing a dodgy device.

If you have a three-disk mirror, you can just fail a drive and replace it with a larger drive, and repeat until all your drives have been replaced with bigger drives. If you've only got a two-drive mirror, the earlier advice to get an add-in SATA PCI card or somesuch applies. Either add the new drive then fail the old one, or put the old one in a USB cage on a temporary basis and --replace it.

If you have a parity raid, then if you have a raid-6, again you can just fail a drive and then add in a new one, but this is not the best idea. If you don't want to place your data at risk, then for a raid-5 you *must*, and for a raid-6 you *should*, use a spare SATA port to do a --replace, or swap your old drive into a USB cage so you can do a --replace that way.

Once you've increased the size of all the underlying devices, you can increase the size of the array, and then increase the size of the filesystem or partitions in the array.

Managing an array

Changing the size of the array

Adding a new device to a mirror won't change the size of the array - the new device will store an extra copy of the data and there won't be any extra space. But adding a device to a parity array will normally increase the space available. Assuming we have a 3 by 4TB raid-5, that gives us an 8TB array. Adding a fourth 4TB will give us a 12TB array.

Assuming it's possible, adding a 3TB device would only give us an extra 1TB, as the formula for raid-5 is devices-1 times size-of-smallest-device. This is, however, unlikely to be possible. To do so, you would probably have to reduce the the size of the array to 6TB (--size=3TB) before adding the new drive (which would then increase capacity to 9TB).

Adding that fourth drive has increased the size of the array, which we can take advantage of by resizing the file system on it (see the next section). But what happens if we've added a 5TB drive to our three 4TB drives? Unfortunately, that extra 1TB is wasted because this drive is larger. So let's assume that rather than adding a fourth - 5TB - drive, we've actually replaced all 4TB drives to give us a 3 x 5TB drive. Unfortunately, we've still only got an 8TB array. We need to tell mdadm about the extra space available with

mdadm --grow /dev/mdN --size=max

The --size option tells the array how much of each disk to use. When an array is created it defaults to the size of the smallest drive but when, as here, we have replaced all the drives with larger drives we need to explicitly tell mdadm about the space. The "max" argument tells it, once again, to default to the size of the smallest drive currently in the array.

Note that there has been the odd report of --size=max not working. Make sure the kernel knows about the new disk size, as this was the problem - you may have to re-hot-add the drive or reboot the computer.

Using the new space

If the array was originally partitioned, the new space will now be available to resize existing partitions or add new ones. If the array had a filesystem on it, the filesystem can now be expanded. Different filesystems have different commands, for example

resize2fs /dev/mdN