Replacing a failed drive

From Linux Raid Wiki
Jump to: navigation, search
Back to Easy Fixes Forward to My array won't assemble / run

Contents

Can you add a new drive?

One thing that repeatedly crops up is people trying to rebuild an array, but they have no spare SATA slots to add the new drive. Get an add-in card that adds extra SATA slots, or a USB disk cradle - preferably USB3, but USB2 will do. You can get a usb case instead, but a cradle that leaves the drive exposed is going to be a lot easier if you have to switch several drives. You can't use USB drives as part of your array in normal use (it interacts badly with raid), but for salvaging data off a drive you are replacing it is fine.

If you're forced to rebuild an array, you want as much of the original array in place as possible. If a drive has failed completely, then there's no problem just taking it out and sticking a new drive in, but if the array is in trouble, then you want to --replace a drive if possible, and you can't do that if you don't have a spare slot to add the drive.

Is your array still redundant?

You are running a three-disk mirror, or RAID6, I trust. And have a spare drive configured to take over when one fails?

This is the ideal scenario. When one drive fails, the RAID will seamlessly replace it with a spare, and life will carry on without the user noticing anything is wrong. Most of us don't have the hardware to support all that. Or, as has been told to the author on several occasions, the admins/operators were not monitoring the array and did not realise anything had gone wrong until (almost) too late.

If you are running an array you need to monitor it. Failed drives must be removed and replaced as soon as possible. If your array is still redundant, then just remove the failed device and replace it.

mdadm /dev/mdN [--fail /dev/sdx1] --remove /dev/sdx1 --add /dev/sdy1

So you have no redundancy!

Remember, RAID is not a backup! If you lose redundancy, you need to take a backup! The very act of trying to recover an array is often enough to tip another drive over the edge and cause the entire array to fail. This is a very common scenario with desktop drives, so make sure you have read and understood the section on timeout mismatch!

Before attempting to replace a failed drive, you should ALWAYS attempt to restore redundancy before replacing a drive. If the array is redundant, a replace will just copy the dodgy drive, only stressing the old drives if it can't read the dodgy drive. If the array is not redundant, then adding a new drive will stress the entire array as it has to recreate the new drive from the remaining old drives. This is true even for a mirror.

If you have set up a bitmap on your array, then even if you plan to replace the failed drive it is worth doing a re-add. With a bitmap, the raid will know which sectors need writing, and will recover the array for you. You can then consider doing a replace of the faulty drive.

mdadm /dev/mdN --re-add /dev/sdX1
mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1

If possible, do not mount the partitions on the raid (or do it read-only) while the rebuild takes place, to avoid undue stress on the array.

If you do not have a bitmap then don't re-add the old drive unless it was a (now fixed!) timeout problem and you don't intend replacing the drive. Without a bitmap, then the raid will treat it like a new drive and do a total rebuild.

It is quite likely that the array will crash with another drive having problems - hence the advice not to allow user writes to the array while it's rebuilding. If this happens, then stop and reassemble the array.

mdadm --stop /dev/mdN
mdadm --assemble --force /dev/mdN /dev/sd[XYZ]1

All being well, the rebuild will continue.

ddrescue or rebuild?

If you can afford downtime, are worried about the health of your other drives, and intend to replace several of them, it is probably easier to do a brute-force disk copy. mdadm doesn't care where a partition is, it just cares that it has the partition and the superblock to enable it to assemble the array. You do have a spare slot you can plug a drive in while you're replacing it?

For each drive that you're planning to replace, put the new drive into the system on a SATA bus. If you don't have any spare SATA ports, swap it out with the old drive and put the old drive in your USB cage. If the new drive is the same size or bigger than the old one, you can just copy everything if you want to:

ddrescue if=/dev/sdX of=/dev/sdY

MAKE SURE you get the drive designations right! Double check, then double check again!

Or you can partition the new drive with your favourite program, be it fdisk, gdisk, parted or whatever, and then copy the drive partition by partition.

ddrescue if=/dev/sdXn of=/dev/sdYn

So long as the output device is larger or equal to the input device then everything will be fine. If you have surplus space, you can then use the appropriate tools to bring that into use.

IFF you have successfully copied all the drives, WITHOUT ERRORS, and installed the new drives into the system, your system should boot back up to the state it was before. If that was a degraded state, you should be able to add back the copy of the faulty drive, knowing that the array consists of new healthy hardware and should recover without error.

This is why you use ddrescue, and not dd. ddrescue, while looking just like dd at a casual glance, is designed to recover data from failing hardware and has a big repertoire of tricks up its sleeve for working round errors. It also creates a failure log which, while mdadm currently cannot make use of it, could be used to mark failed blocks for a utility (yet to be written :-( to recover from other disks in the array.

If you cannot copy your drives without errors, you are now almost certainly into the realm of lost data, unless you can manage to do a successful rebuild instead.

Back to Easy Fixes Forward to My array won't assemble / run
Personal tools