Software RAID devices are so-called "block" devices, like ordinary disks or disk partitions. A RAID device is "built" from a number of other block devices - for example, a RAID-1 could be built from two ordinary disks, or from two disk partitions (on separate disks - please see the description of RAID-1 for details on this).
There are no other special requirements to the devices from which you build your RAID devices - this gives you a lot of freedom in designing your RAID solution. For example, you can build a RAID from a mix of IDE and SCSI devices, and you can even build a RAID from other RAID devices (this is useful for RAID-0+1, where you simply construct two RAID-1 devices from ordinary disks, and finally construct a RAID-0 device from those two RAID-1 devices).
Therefore, in the following text, we will use the word "device" as meaning "disk", "partition", or even "RAID device". A "device" in the following text simply refers to a "Linux block device". It could be anything from a SCSI disk to a network block device. We will commonly refer to these "devices" simply as "disks", because that is what they will be in the common case.
However, there are several roles that devices can play in your arrays. A device could be a "spare disk", it could have failed and thus be a "faulty disk", or it could be a normally working and fully functional device actively used by the array.
In the following we describe two special types of devices; namely the "spare disks" and the "faulty disks".
It is worth mentioning the existence of the FAULTY RAID level - don't get confused - this is a special debugging level of RAID that uses a normal device and simulates faults.
Spare disks (often called hot spares) are disks that do not take part in the RAID set until one of the active disks fail. When a device failure is detected, that device is marked as "faulty" and reconstruction is immediately started on the first spare disk available.
Thus, spare disks add a nice extra safety to especially RAID-5 systems that perhaps are hard to get to (physically). One can allow the system to run for some time, with a faulty device, since the spare disk takes the place of the faulty device and all redundancy is restored.
It is also possible to have spare disks spin-down to save energy; obviously the spin-up time for these warm spares is insignificant compared to the resync time.
You cannot be sure that your system will keep running after a disk crash though. The RAID layer should handle device failures just fine, but SCSI drivers could be broken on error handling, or the IDE chipset could lock up, or a lot of other things could happen.
Also, once reconstruction to a hot-spare begins, the RAID layer will start reading from all the other disks to re-create the redundant information. If multiple disks have built up bad blocks over time, the reconstruction itself can actually trigger a failure on one of the "good" disks. This can lead to a complete RAID failure and is the major reason for using RAID-6 in preference to RAID-5 and a hot spare.
If you do frequent backups of the entire filesystem on the RAID array, then it is highly unlikely that you would ever get in this situation - this is another very good reason for taking frequent backups. Remember, RAID is not a substitute for backups.
When the RAID layer handles device failures just fine, crashed disks are marked as faulty, and reconstruction is immediately started on the first spare-disk available. If no spare is available then the array runs in 'degraded' mode.
Faulty disks still appear and behave as members of the array. The RAID layer just avoids reading/writing them.
If a device needs to be removed from an array for any reason (eg pro-active replacement due to SMART reports) then it must be marked as faulty before it can be removed.
The section on Detecting, querying and testing provides more information.