Recovering a failed software RAID
The software RAID in Linux is well tested, but even with well tested software RAID can fail.
In the following it is assume that you have a software RAID where a disk more than the redundancy has failed.
So your /proc/mdstats looks something like this:
md3 : active raid6 loop3[10](S) loop39[9] loop38[8] loop37[7] loop36[6] loop35[5] loop34[4] loop33[3](F) loop32[2](F) loop31[1](F) loop30[0]
7168 blocks super 1.2 level 6, 128k chunk, algorithm 2 [10/7] [U___UUUUUU]
Here is a RAID6 that has lot 3 harddisks.
Contents |
Check your hardware
Harddisks fall off a RAID for all sorts of reasons. Some of them are intermittent, so first we need to check if the harddisks are OK. We do that by reading every single harddisk.
DEVICES="/dev/sdc /dev/sdd /dev/sdf" parallel -j0 dd if={} of=/dev/null ::: $DEVICES
If you do not have GNU Parallel, it can be installed by:
wget -O - pi.dk/3 | bash
Hardware error
If the reading fails for a drive, you need to copy that drive to a new drive. Do that using GNU ddrescue. ddrescue can read forwards (fast) and backwards (slow). This is useful since you can sometime read a sector if you read it from "the other side". By giving ddrescue a log-file it will skip the parts that have already been copied successfully. Thereby it is OK to reboot your system, if the copying makes the system stuck.
ddrescue -r 3 /dev/old /dev/new my_log ddrescue -R -r 3 /dev/old /dev/new my_log
where /dev/old is the harddisk with errors and /dev/new is the new empty harddisk.
Re-test that you can now read all sectors using 'dd', and remove /dev/old from the system.
Notes below - must be fleshed out
Making the devices read-only
http://unix.stackexchange.com/questions/67678/gnu-linux-overlay-block-device-stackable-block-device
Identify which drives are used for what
Identify the currently active harddisks and the last failing (but fully synced) hard disk.
Force assembly
--assemble --force (--scan may be incorrect due to the overlay devices).
File system check
Some file systems wants mount before fsck (xfs).
mount (for xfs)
umount (for xfs)
fsck
mount
check files are there. If not, roll back to the overlay to non-modified, and try different options. For xfs_repair that can include -L to remove the log.