Devices

From Linux Raid Wiki
(Difference between revisions)
Jump to: navigation, search
 
(develop spares text and added RAID-6)
Line 30: Line 30:
 
"spare disks" and the "faulty disks".
 
"spare disks" and the "faulty disks".
  
To avoid confusion it is worth mentioning the existence of [[MD Faulty]] - this is a special debugging type of RAID that uses a normal device and simulates faults.
+
It is worth mentioning the existence of the
 +
[[Overview#FAULTY|FAULTY RAID level]] - don't get confused - this is a
 +
special debugging level of RAID that uses a normal device and simulates faults.
  
 
==Spare disks==
 
==Spare disks==
  
Spare disks are disks that do not take part in the RAID set until one
+
Spare disks (often called hot spares) are disks that do not take part in the RAID set until one
 
of the active disks fail.  When a device failure is detected, that
 
of the active disks fail.  When a device failure is detected, that
device is marked as "bad" and reconstruction is immediately started on
+
device is marked as "faulty" and reconstruction is immediately started on
the first spare-disk available.
+
the first spare disk available.
  
 
Thus, spare disks add a nice extra safety to especially RAID-5 systems
 
Thus, spare disks add a nice extra safety to especially RAID-5 systems
 
that perhaps are hard to get to (physically). One can allow the system
 
that perhaps are hard to get to (physically). One can allow the system
to run for some time, with a faulty device, since all redundancy is
+
to run for some time, with a faulty device, since the spare disk takes
preserved by means of the spare disk.
+
the place of the faulty device and all redundancy is restored.
 +
 
 +
It is also possible to have spare disks spin-down
 +
to save energy; obviously the spin-up time for these warm spares is
 +
insignificant compared to the resync time.  
  
 
You cannot be sure that your system will keep running after a disk
 
You cannot be sure that your system will keep running after a disk
Line 53: Line 59:
 
information. If multiple disks have built up bad blocks over time, the
 
information. If multiple disks have built up bad blocks over time, the
 
reconstruction itself can actually trigger a failure on one of the
 
reconstruction itself can actually trigger a failure on one of the
"good" disks. This will lead to a complete RAID failure. If you do
+
"good" disks. This can lead to a complete RAID failure and is the major
frequent backups of the entire filesystem on the RAID array, then it
+
reason for using RAID-6 in preference to RAID-5 and a hot spare.
 +
 
 +
If you do frequent backups of the entire filesystem on the RAID array, then it
 
is highly unlikely that you would ever get in this situation - this is
 
is highly unlikely that you would ever get in this situation - this is
 
another very good reason for taking frequent backups. Remember, RAID
 
another very good reason for taking frequent backups. Remember, RAID
 
is not a substitute for backups.
 
is not a substitute for backups.
 +
  
 
==Faulty disks==
 
==Faulty disks==
Line 63: Line 72:
 
When the RAID layer handles device failures just fine, crashed disks
 
When the RAID layer handles device failures just fine, crashed disks
 
are marked as faulty, and reconstruction is immediately started on the
 
are marked as faulty, and reconstruction is immediately started on the
first spare-disk available.
+
first spare-disk available. If no spare is available then the array runs
 +
in 'degraded' mode.
  
 
Faulty disks still appear and behave as members of the array. The RAID
 
Faulty disks still appear and behave as members of the array. The RAID
layer just treats crashed devices as inactive parts of the filesystem.
+
layer just avoids reading/writing them.
 +
 
 +
If a device needs to be removed from an array for any reason (eg pro-active
 +
replacement due to SMART reports) then it must be marked as faulty before it
 +
can be removed.
  
(Need comments on the UUU_ syntax and removing them)
+
The section on [[Detecting, querying and testing]] provides more information.

Revision as of 12:49, 3 October 2006

Devices

Software RAID devices are so-called "block" devices, like ordinary disks or disk partitions. A RAID device is "built" from a number of other block devices - for example, a RAID-1 could be built from two ordinary disks, or from two disk partitions (on separate disks - please see the description of RAID-1 for details on this).

There are no other special requirements to the devices from which you build your RAID devices - this gives you a lot of freedom in designing your RAID solution. For example, you can build a RAID from a mix of IDE and SCSI devices, and you can even build a RAID from other RAID devices (this is useful for RAID-0+1, where you simply construct two RAID-1 devices from ordinary disks, and finally construct a RAID-0 device from those two RAID-1 devices).

Therefore, in the following text, we will use the word "device" as meaning "disk", "partition", or even "RAID device". A "device" in the following text simply refers to a "Linux block device". It could be anything from a SCSI disk to a network block device. We will commonly refer to these "devices" simply as "disks", because that is what they will be in the common case.

However, there are several roles that devices can play in your arrays. A device could be a "spare disk", it could have failed and thus be a "faulty disk", or it could be a normally working and fully functional device actively used by the array.

In the following we describe two special types of devices; namely the "spare disks" and the "faulty disks".

It is worth mentioning the existence of the FAULTY RAID level - don't get confused - this is a special debugging level of RAID that uses a normal device and simulates faults.

Spare disks

Spare disks (often called hot spares) are disks that do not take part in the RAID set until one of the active disks fail. When a device failure is detected, that device is marked as "faulty" and reconstruction is immediately started on the first spare disk available.

Thus, spare disks add a nice extra safety to especially RAID-5 systems that perhaps are hard to get to (physically). One can allow the system to run for some time, with a faulty device, since the spare disk takes the place of the faulty device and all redundancy is restored.

It is also possible to have spare disks spin-down to save energy; obviously the spin-up time for these warm spares is insignificant compared to the resync time.

You cannot be sure that your system will keep running after a disk crash though. The RAID layer should handle device failures just fine, but SCSI drivers could be broken on error handling, or the IDE chipset could lock up, or a lot of other things could happen.

Also, once reconstruction to a hot-spare begins, the RAID layer will start reading from all the other disks to re-create the redundant information. If multiple disks have built up bad blocks over time, the reconstruction itself can actually trigger a failure on one of the "good" disks. This can lead to a complete RAID failure and is the major reason for using RAID-6 in preference to RAID-5 and a hot spare.

If you do frequent backups of the entire filesystem on the RAID array, then it is highly unlikely that you would ever get in this situation - this is another very good reason for taking frequent backups. Remember, RAID is not a substitute for backups.


Faulty disks

When the RAID layer handles device failures just fine, crashed disks are marked as faulty, and reconstruction is immediately started on the first spare-disk available. If no spare is available then the array runs in 'degraded' mode.

Faulty disks still appear and behave as members of the array. The RAID layer just avoids reading/writing them.

If a device needs to be removed from an array for any reason (eg pro-active replacement due to SMART reports) then it must be marked as faulty before it can be removed.

The section on Detecting, querying and testing provides more information.

Personal tools