Replacing a failed drive

From Linux Raid Wiki
(Difference between revisions)
Jump to: navigation, search
m
m (So you have no redundancy!)
Line 26: Line 26:
 
== So you have no redundancy! ==
 
== So you have no redundancy! ==
  
Remember, RAID is not a backup! If you lose redundancy, you need to take a backup! The act of trying to recover an array is often enough to trip another drive over the edge an cause the entire array to fail.
+
Remember, RAID is not a backup! If you lose redundancy, you need to take a backup! The act of trying to recover an array is often enough to tip another drive over the edge and cause the entire array to fail. This is a very common scenario with desktop drives, so make sure you have read and understood the section on timeout mismatch!
 +
 
 +
If you have set up a bitmap on your array, then even if you plan to replace the failed drive it is worth doing a re-add. With a bitmap, the raid will know which sectors need writing, and will recover the array for you. You can then consider doing a replace of the faulty drive.
 +
 
 +
mdmad /dev/mdN --re-add /dev/sdX1
 +
mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1
 +
 
 +
If possible, do not mount the partitions on the raid (or do it read-only) while the rebuild takes place, to avoid undue stress on the array.
 +
 
 +
If you do not have a bitmap then don't re-add the old drive unless it was a (now fixed!) timeout problem and you don't intend replacing the drive. Without a bitmap, then the raid will treat it like a new drive and do a total rebuild.
  
 
== ddrescue or rebuild ==
 
== ddrescue or rebuild ==

Revision as of 12:25, 24 September 2016

Back to Timeout Mismatch Forward to Reconstruction

Contents

Can you add a new drive?

One thing that repeatedly crops up is people trying to rebuild an array, but they have no spare SATA slots to add the new drive. Get an add-in card that adds extra SATA slots, or a USB disk cradle - preferably USB3, but USB2 will do. You can get a usb case instead, but a cradle that leaves the drive exposed is going to be a lot easier if you have to switch several drives.

If you're forced to rebuild an array, you want as much of the original array in place as possible. If a drive has failed completely, then there's no problem just taking it out and sticking a new drive in, but if the array is in trouble, then you want to --replace a drive if possible, and you can't do that if you don't have a spare slot to add the drive.

Is your array still redundant?

You are running a three-disk mirror, or RAID6, I trust. And have a spare drive configured to take over when one fails?

This is the ideal scenario. When one drive fails, the RAID will seamlessly replace it with a spare, and life will carry on without the user noticing anything is wrong. Most of us don't have the hardware to support all that. Or, as has been told to the author on several occasions, the admins/operators were not monitoring the array and did not realise anything had gone wrong until (almost) too late.

If you are running an array you need to monitor it. Failed drives must be removed and replaced as soon as possible. If your array is still redundant, then just remove the failed device and replace it.

mdadm /dev/mdX [--fail /dev/sdx1] --remove /dev/sdx1 --add /dev/sdy1

So you have no redundancy!

Remember, RAID is not a backup! If you lose redundancy, you need to take a backup! The act of trying to recover an array is often enough to tip another drive over the edge and cause the entire array to fail. This is a very common scenario with desktop drives, so make sure you have read and understood the section on timeout mismatch!

If you have set up a bitmap on your array, then even if you plan to replace the failed drive it is worth doing a re-add. With a bitmap, the raid will know which sectors need writing, and will recover the array for you. You can then consider doing a replace of the faulty drive.

mdmad /dev/mdN --re-add /dev/sdX1
mdadm /dev/mdN --add /dev/sdY1 --replace /dev/sdX1 --with /dev/sdY1

If possible, do not mount the partitions on the raid (or do it read-only) while the rebuild takes place, to avoid undue stress on the array.

If you do not have a bitmap then don't re-add the old drive unless it was a (now fixed!) timeout problem and you don't intend replacing the drive. Without a bitmap, then the raid will treat it like a new drive and do a total rebuild.

ddrescue or rebuild

Back to Timeout Mismatch Forward to Reconstruction
Personal tools