If you have read the rest of this HOWTO, you should already have a pretty good idea about what reconstruction of a degraded RAID involves. Let us summarize:
- Power down the system
- Replace the failed disk
- Power up the system once again.
- Use raidhotadd /dev/mdX /dev/sdX to re-insert the disk in the array
- Have coffee while you watch the automatic reconstruction running
In some cases, the "good" disk does not have a boot block, as might happen if the degraded disk is the "first" one, e.g. the hda or sda device. In this case you might not be able to boot the system. Try to reconstruct the MBR with the boot loader of choice. The installation disk of your linux distro might have a rescue mode and assist you in this task. There is also a bootable tool available called "super grub disk" (http://www.supergrubdisk.org/) which boots stray linux installations in seconds.
And that's it.
Well, it usually is, unless you're unlucky and your RAID has been rendered unusable because more disks than the ones redundant failed. This can actually happen if a number of disks reside on the same bus, and one disk takes the bus with it as it crashes. The other disks, however fine, will be unreachable to the RAID layer, because the bus is down, and they will be marked as faulty. On a RAID-5 where you can spare one disk only, losing two or more disks can be fatal.
Recovery from a multiple disk failure
This section is the explanation that Martin Bene gave to me, and describes a possible recovery from the scary scenario outlined above. It involves using the failed-disk directive in your /etc/raidtab (so for people running patched 2.2 kernels, this will only work on kernels 2.2.10 and later).
The scenario is:
- A controller dies and takes two disks offline at the same time,
- All disks on one scsi bus can no longer be reached if a disk dies,
- A cable comes loose...
In short: quite often you get a temporary failure of several disks at once; afterwards the RAID superblocks are out of sync and you can no longer init your RAID array.
If using mdadm, you could first try to run:
mdadm --assemble --force
If not, there's one thing left: rewrite the RAID superblocks by mkraid --force
To get this to work, you'll need to have an up to date /etc/raidtab - if it doesn't EXACTLY match devices and ordering of the original disks this will not work as expected, but will most likely completely obliterate whatever data you used to have on your disks.
Look at the sylog produced by trying to start the array, you'll see the event count for each superblock; usually it's best to leave out the disk with the lowest event count, i.e the oldest one.
If you mkraid without failed-disk, the recovery thread will kick in immediately and start rebuilding the parity blocks - not necessarily what you want at that moment.
With failed-disk you can specify exactly which disks you want to be active and perhaps try different combinations for best results. BTW, only mount the filesystem read-only while trying this out... This has been successfully used by at least two guys I've been in contact with.
recovery and resync
The following is a recollection of what Neil Brown and others have written on the linux-raid mailing list.
"resync" and "recovery" are handled very differently in raid10. "check" and "repair" are special cases of "resync".
The purpose of the recovery process is to fill a new disk with the relevant information from a running array.
The assumption is that all data on the new disk needs to be written, and that the other data on the running array is correct.
"recovery" walks addresses from the start to the end of the component drives. Thus only data for the specific component drive is adressed.
At each address, it considers each drive which is being recovered and finds a place on a different device to read the block for the current (drive,address) from. It schedules a read and when the read request completes it schedules the write.
On an f2 layout, this will read one drive from halfway to the end, then from the start to halfway, and will write the other drive sequentially.
The purpose of resync is to ensure that all data on the array is syncronized.
There is an assumption that most, if not all, of the data is allready OK.
For raid10 "resync" walks the addresses from the start to end of the array. (For all other raid types "resync" follows the component drives).
At each address it reads every device block which stores that array block. When all the reads complete the results are compared. If they are not all the same, the "first" block is written out to the others.
Here "first" means (I think) the block with the earliest device address, and if there are several of those, the block with the least device index.
So for f2, this will read from both the start and the middle of both devices. It will read 64K (the chunk size) at a time, so you should get at least a 32K read at each position before a seek (more with a larger chunk size).
Clearly this won't be fast.
The reason this algorithm was chosen was that it makes sense for every possible raid10 layout, even though it might not be optimal for some of them.
Were I to try to make it fast for f2, I would probably shuffle the bits in each request so that it did all the 'odd' chunks first, then all the even chunks. e.g. map
0 1 2 3 4 5 6 7 8 ...
0 1 4 5 8 9 ..... 2 3 6 7 10 11 ....
(assuming a chunk size of '2').
The problem with this is that if you shutdown while part way though a resync, and then boot into a kernel which used a different sequence, it would finish the resync checking the wrong blocks. This is annoying but should not be insurmountable.
This way we leave the basic algorithm the same, but introduce variations in the sequence for different specific layouts.
Another idea would be to read a number of chunks from one part of the f2 mirror, say 10 MB, and then read then corresponding 10 MB from the other half of the f2 array. This would on current disk technology (80 MB/s) mean 125 ms spent reading, and then 8 ms spent moving heads.
raid1 does resync simply by reading one device and writing all the others, and this is conceptually easiest.
When repairing, there is no "good" block - if they are different, then all are wrong. md/raid just tries to return a consistent value, and leave it up to the filesystem to find and correct any errors. md/raid does not try to take advantage of information on failed CRC on disk hardware, should that info be available to the kernel.
If any inconsistency is found during a resync of raid4/5/6 the parity blocks are changed to remove the inconsistency. This may not be "right", but it is least likely to be "wrong".