Assemble Run

Back to Replacing a failed drive Forward to Recovering a damaged RAID

Has the raid really failed?

You are now looking at the serious possibility of data loss. But don't lose hope yet. Do you know the layout of your array? What drives are missing? Why?

dmesg is your friend here. If a drive has failed, it should have a load of messages as to why. If the system failed at boot, it will have a load of messages as to why the array would not assemble.

This is the point at which you post all the information that you have collected to the mailing list and wait for the experts to help. In a recent thread, a poster had one drive left from a 3-drive raid 5. It turned out that something had stomped on the disk GPT and there was nothing wrong with the raid itself. This had a very happy ending.

Check the drive event count

For all the partitions in the array, check the event count from mdadm --examine. For an N-drive raid-5, you need N-1 drives. All being well, you will have N-2 drives with the same event count, and one with a slightly lower event count. The last drive probably failed ages ago and will have a much lower event count. The bigger the discrepancy in the event count, the bigger the data loss will be.

Check the smartctl stats for all the drives. Are they dying? Are they desktop drives? Did you fix the timeout problem!

If you haven't fixed the timeout mismatch, do it now!

If only one or two drives are dying, do a brute force copy ddrescue as per the previous page. If the copies are not successful, then it is a judgement call whether to repair the array or jump to the next section and copy the array. The problem with replacing a dying drive with an incomplete ddrescue copy, is that the raid has no way of knowing which blocks failed to copy, and no way of reconstructing them even if it did. In other words, random blocks will return garbage (probably in the form of a block of nulls) in response to a read request.

Either way, now forcibly assemble the array using the drives with the highest event count, and the drive that failed most recently, to bring the array up in degraded mode.

mdadm --force --assemble /dev/mdN /dev/sd[XYZ]1

If you are lucky, the missing writes are unimportant. If you are happy with the health of your drives, now add a new drive to restore redundancy.

mdadm /dev/mdN --add /dev/sdW1

and do a filesystem check fsck to try and find the inevitable corruption.

My drives are all faulty

Rebuilding an array is stressful. If all your drives are on the edge, there is no point asking for trouble and attempting a rebuild. Get new drives, create a new array, and copy the data from the old array to the new. Unless your drives are very full, it is probably less stressful to do a data copy than a ddrescue on all the drives.

Once you have the two arrays up

mount /dev/md/oldarray /mnt/old
mount /dev/md/newarray /mnt/new
rsync --archive /mnt/old/* /mnt/new

[TODO: Check that the arguments to rsync are correct/complete]

All being well, you will know which files are corrupt because they will fail to copy, but you will need to do an integrity check on your data. The new array will, of course, be fine.

Back to Replacing a failed drive Forward to Recovering a damaged RAID

Assemble Run

Has the raid really failed?

Check the drive event count

My drives are all faulty

Views

Personal tools

Navigation

Search

Tools