Assemble Run

Back to Replacing a failed drive Forward to Recovering a damaged RAID

But I don't think there's anything wrong?

A common cause of puzzlement as to why an array won't run, is that arrays are normally assembled on boot and this can fail. If the array is left in a partially assembled state, most attempts to recover the array are likely to fail with an error like "device X is in use". This is especially likely if you were doing some array management and had a problem with the computer. You need to stop the array before trying anything else.

mdadm --stop /dev/mdN
mdadm --assemble /dev/mdN /dev/sd[XYZ]1

This is an important point to remember - if you're having difficulty getting an array to run, always stop it between attempts to make sure you start with a clean slate every time. Provided you don't use the "--force" option, mdadm will refuse to do anything dangerous, so if this seems to improve matters you can keep trying different variations.

Has the raid really failed?

You are now looking at the serious possibility of data loss. But don't lose hope yet. Do you know the layout of your array? What drives are missing? Why?

dmesg is your friend here. If a drive has failed, it should have a load of messages as to why. If the system failed at boot, it will have a load of messages as to why the array would not assemble.

This is the point at which you post all the information that you have collected to the mailing list and wait for the experts to help. In a recent thread, a poster had one drive left from a 3-drive raid 5. It turned out that something had stomped on the disk GPT and there was nothing wrong with the raid itself. This had a very happy ending.

Check the drive event count

For all the partitions in the array, check the event count from mdadm --examine. For an N-drive raid-5, you need N-1 drives. All being well, you will have N-2 drives with the same event count, and one with a slightly lower event count. The last drive probably failed ages ago and will have a much lower event count. The bigger the discrepancy in the event count, the bigger the data loss will be.

Check the smartctl stats for all the drives. Are they dying? Are they desktop drives? Did you fix the timeout problem!

If you haven't fixed the timeout mismatch, do it now!

If only one or two drives are dying, do a brute force copy ddrescue as per the previous page. If the copies are not successful, then it is a judgement call whether to repair the array or jump to the next section and copy the array. The problem with replacing a dying drive with an incomplete ddrescue copy, is that the raid has no way of knowing which blocks failed to copy, and no way of reconstructing them even if it did. In other words, random blocks will return garbage (probably in the form of a block of nulls) in response to a read request.

Either way, now forcibly assemble the array using the drives with the highest event count, and the drive that failed most recently, to bring the array up in degraded mode.

mdadm --force --assemble /dev/mdN /dev/sd[XYZ]1

If you are lucky, the missing writes are unimportant. If you are happy with the health of your drives, now add a new drive to restore redundancy.

mdadm /dev/mdN --add /dev/sdW1

and do a filesystem check fsck to try and find the inevitable corruption.

My drives are all faulty

Rebuilding an array is stressful. If all your drives are on the edge, there is no point asking for trouble and attempting a rebuild. Get new drives, create a new array, and copy the data from the old array to the new. Unless your drives are very full, it is probably less stressful to do a data copy than a ddrescue on all the drives.

Once you have the two arrays up

mount /dev/md/oldarray /mnt/old
mount /dev/md/newarray /mnt/new
rsync --archive /mnt/old/* /mnt/new

[TODO: Check that the arguments to rsync are correct/complete]

All being well, you will know which files are corrupt because they will fail to copy, but you will need to do an integrity check on your data. The new array will, of course, be fine.

Back to Replacing a failed drive Forward to Recovering a damaged RAID

Assemble Run

Contents

But I don't think there's anything wrong?

Has the raid really failed?

Check the drive event count

My drives are all faulty

Views

Personal tools

Navigation

Search

Tools