Recovering a failed software RAID

From Linux Raid Wiki
Revision as of 14:28, 4 May 2013 by Tange (Talk | contribs)

Jump to: navigation, search

The software RAID in Linux is well tested, but even with well tested software, RAID can fail.

In the following it is assumed that you have a software RAID where a disk more than the redundancy has failed.

So your /proc/mdstats looks something like this:

 md0 : active raid6 sdn1[6](S) sdm1[5] sdk1[3](F) sdj1[2] sdh1[1](F) sdg1[0](F)
     305664 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/2] [__U_U]

Here is a RAID6 that has lost 3 harddisks.

Contents

What happened?

This article will deal with the following case. It starts out as a perfect RAID6 (state 1):

 md0 : active raid6 sdn1[6](S) sdm1[5] sdk1[3] sdj1[2] sdh1[1] sdg1[0]
     305664 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]

For some unknown reason /dev/sdk1 fails and rebuild starts on the spare /dev/sdn1 (state 2):

 md0 : active raid6 sdn1[6] sdm1[5] sdk1[3](F) sdj1[2] sdh1[1] sdg1[0]
     305664 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [UUU_U]
     [===>.................]  recovery = 16.0% (16744/101888) finish=1.7min speed=797K/sec

During the rebuild /dev/sdg1 fails, too. Now all redundancy is lost, and loosing another data disk will fail the RAID. The rebuild on /dev/sdn1 continues (state 3):

 md0 : active raid6 sdn1[6] sdm1[5] sdk1[3](F) sdj1[2] sdh1[1] sdg1[0](F)
     305664 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/3] [_UU_U]
     [===========>.........]  recovery = 59.0% (60900/101888) finish=0.6min speed=1018K/sec

Before the rebuild finishes, yet another data harddisk (/dev/sdh1) fails, thus failing the RAID. The rebuild on /dev/sdn1 cannot continue, so /dev/sdn1 reverts to its status as spare (state 4):

 md0 : active raid6 sdn1[6](S) sdm1[5] sdk1[3](F) sdj1[2] sdh1[1](F) sdg1[0](F)
     305664 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/2] [__U_U]

This is the situation we are going to recover from. The goal is to get back to state 3 with minimal data loss.


Tools

We will be using the following tools:

GNU Parallel - http://www.gnu.org/software/parallel/ If it is not packaged for your distribution install by:

 wget -O - pi.dk/3 | bash

GNU ddrescue - http://www.gnu.org/software/ddrescue/ddrescue.html


Identifying the RAID

We will need the UUID of the array to identify the harddisks. This is especially important if you have multiple RAIDs connected to the system. Take the UUID from one of the non-failed harddisks (here /dev/sdj1):

 $ UUID=$(mdadm -E /dev/sdj1|perl -ne '/Array UUID : (\S+)/ and print $1')
 $ echo $UUID
 ef1de98a:35abe6d9:bcfa355a:d30dfc24

The harddisks are right now kicked off by the kernel and not visible any more, so you need to make the kernel re-discover the devices. That can be done by re-seating the harddisks (if they are hotswap) or by rebooting. After the re-reating/rebooting the failed harddisks will often be given different device names.

We use the $UUID to identify the new device names:

 $ DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})
 {5}     mdadm: cannot open /dev/{5}: No such file or directory
 sda1    mdadm: No md superblock detected on /dev/sda1.
 sdb1    mdadm: No md superblock detected on /dev/sdb1.
 $ echo $DEVICES
 /dev/sdj1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1

Stop the RAID

You should now stop the RAID as that may otherwise cause problems later on:

 mdadm --stop /dev/md0

If you cannot stop the RAID (due to the RAID being mounted), noted down the RAID ID and reboot.

Check your hardware

Harddisks fall off a RAID for all sorts of reasons. Some of them are intermittent, so first we need to check if the harddisks are OK.

We do that by reading every sector on every single harddisk.

 parallel -j0 dd if={} of=/dev/null ::: $DEVICES


Hardware error

If the reading fails for a harddisk, you need to copy that harddisk to a new harddisk. Do that using GNU ddrescue. ddrescue can read forwards (fast) and backwards (slow). This is useful since you can sometimes only read a sector if you read it from "the other side". By giving ddrescue a log-file it will skip the parts that have already been copied successfully. Thereby it is OK to reboot your system, if the copying makes the system hang: The copying will continue where it left off.

 ddrescue -r 3 /dev/old /dev/new my_log
 ddrescue -R -r 3 /dev/old /dev/new my_log

where /dev/old is the harddisk with errors and /dev/new is the new empty harddisk.

Re-test that you can now read all sectors from /dev/new using 'dd', and remove /dev/old from the system. Then recompute $DEVICES to include the /dev/new:

 UUID=$(mdadm -E /dev/sdj1|perl -ne '/Array UUID : (\S+)/ and print $1')
 DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})

Making the harddisks read-only using an overlay file

When trying to fix a broken RAID we may cause more damage, so we need a way to revert to the current situation. One way is to make a full harddisk-to-harddisk image of every harddisk. This is slow and requires a full set of empty disks.

A faster solution is to overlay every device with a file. All changes will be written to the file and the actual device is untouched. We need to make sure the file is big enough to hold all changes, but 'fsck' normally will not change a lot, so your local file system should be able to hold around 1% of used space in the RAID. If your filesystem supports big, sparse files, you can simply make a sparse overlay file for each harddisk the same size as the harddisk.

Each overlay file will need a loop-device, so create that:

 parallel 'test -e /dev/loop{#} || mknod -m 660 /dev/loop{#} b 7 {#}' ::: $DEVICES

Now create an overlay file for each device. Here it is assumed that your filsystem supports big, sparse files and the harddisks are 4TB. If it fails create a smaller file (usually 1% of the harddisk capacity is sufficient):

 parallel truncate -s4000G overlay-{/} ::: $DEVICES

Setup the loop-device and the overlay device:

 parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show -- overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}' ::: $DEVICES

Now the overlay devices are in /dev/mapper/*. You can check their disk usage using:

 dmsetup status

Reset overlay file

You may later need to reset to go back to the original situation. You do that by:

 parallel 'dmsetup remove {/}; rm overlay-{/}' ::: $DEVICES 
 parallel losetup -d ::: /dev/loop*

Notes below - must be fleshed out

Identify which drives are used for what

Identify the currently active harddisks and the last failing (but fully synced) hard disk.

 parallel --tag mdadm -E ::: $DEVICES |grep Upd

/dev/loop1 Update Time : Fri May 3 12:50:04 2013 /dev/loop10 Update Time : Fri May 3 12:50:04 2013 /dev/loop11 Update Time : Tue Apr 30 13:20:42 2013 - first to fail /dev/loop12 Update Time : Fri May 3 12:48:42 2013 - second /dev/loop13 Update Time : Fri May 3 12:48:51 2013 - third /dev/loop14 Update Time : Fri May 3 12:49:49 2013 - fourth /dev/loop15 Update Time : Fri May 3 12:49:57 2013 - fifth /dev/loop16 Update Time : Fri May 3 12:50:04 2013 /dev/loop17 Update Time : Fri May 3 12:50:04 2013 /dev/loop18 Update Time : Fri May 3 12:50:04 2013 /dev/loop19 Update Time : Fri May 3 12:50:04 2013

The example should show an active disk that failed during rebuilding on a spare.

Force assembly

ACTIVE=some of $DEVICES LAST_FAIL=last failing of $DEVICES --assemble --force $ACTIVE $LAST_FAIL (--scan may be incorrect due to the overlay devices).

File system check

Some file systems wants mount before fsck (xfs).

mount (for xfs)

umount (for xfs)

fsck

mount

check files are there. If not, reset the overlay, setup the overlay again, and try different options. For xfs_repair that can include -L to remove the log.

If everything is OK, then do the same mdadm --assemble, mount, fsck procedure, but this time on the real harddisks without the overlay files.

Personal tools