From Linux Raid Wiki
Jump to: navigation, search
Back to Why RAID? Forward to Hardware issues


Software RAID devices are so-called "block" devices, like ordinary disks or disk partitions. A RAID device is "built" from a number of other block devices - for example, a RAID-1 could be built from two ordinary disks, or from two disk partitions (on separate disks - please see the description of RAID-1 for details on this).

(It is recommended not build a RAID array directly on a disk. It's not a problem with RAID, but some disk utilities assume that a drive without a GPT or MBR is blank and will happily stomp all over it.)

There are no other special requirements to the devices from which you build your RAID devices - this gives you a lot of freedom in designing your RAID solution. For example, you can build a RAID from a mix of SATA, network and other RAID devices (this is useful for RAID-0+1, where you simply construct two RAID-1 devices from ordinary disks, and finally construct a RAID-0 device from those two RAID-1 devices). It is not adviseable to use USB devices, however, as these go to sleep and interact badly with the raid code.

Therefore, in the following text, we will use the word "device" as meaning "disk", "partition", or even "RAID device". A "device" in the following text simply refers to a "Linux block device". It could be anything from a SATA disk to a network block device. We will commonly refer to these "devices" simply as "disks", because that is what they will be in the common case.

However, there are several roles that devices can play in your arrays. A device could be a "spare disk", it could have failed and thus be a "faulty disk", or it could be a normally working and fully functional device actively used by the array.

In the following we describe two special types of devices; namely the "spare disks" and the "faulty disks".

It is worth mentioning the existence of the FAULTY RAID level - don't get confused - this is a special debugging level of RAID that uses a normal device and simulates faults.

Spare disks

Spare disks (often called hot spares) are disks that do not take part in the RAID set until one of the active disks fail. When a device failure is detected, that device is marked as "faulty" and reconstruction is immediately started on the first spare disk available.

Thus, spare disks add a nice extra safety to especially RAID-5 systems that perhaps are hard to get to (physically). One can allow the system to run for some time, with a faulty device, since the spare disk takes the place of the faulty device and all redundancy is restored.

It is also possible to have spare disks spin-down to save energy; obviously the spin-up time for these warm spares is insignificant compared to the resync time.

You cannot be sure that your system will keep running after a disk crash though. The RAID layer should handle device failures just fine, but SCSI drivers could be broken on error handling, or the IDE chipset could lock up, or a lot of other things could happen.

Also, once reconstruction to a hot-spare begins, the RAID layer will start reading from all the other disks to re-create the redundant information. If multiple disks have built up bad blocks over time, the reconstruction itself can actually trigger a failure on one of the "good" disks. This can lead to a complete RAID failure and is the major reason for using RAID-6 in preference to RAID-5 and a hot spare. Indeed, if using the wrong sort of disk it commonly leads to a complete raid failure. (It is usually possible to recover from this situation, however.)

If you do frequent backups of the entire filesystem on the RAID array, or scrub the array regularly, then it is highly unlikely that you would ever get in this situation - this is another very good reason for taking frequent backups. Remember, RAID is not a substitute for backups.

Faulty disks

When the RAID layer handles device failures just fine, crashed disks are marked as faulty, and reconstruction is immediately started on the first spare-disk available. If no spare is available then the array runs in 'degraded' mode.

Faulty disks still appear and behave as members of the array. The RAID layer just avoids reading/writing them.

If a device needs to be removed from an array for any reason (eg pro-active replacement due to SMART reports) then it must be marked as faulty before it can be removed.

The section on Detecting, querying and testing provides more information.

Back to Why RAID? Forward to Hardware issues
Personal tools