Preventing against a failing disk

From Linux Raid Wiki
Jump to: navigation, search

The following will describe how to prepare a system to survive if one disk fails. This can be important for a server which is intended to always run. The description is mostly aimed at small servers, but it can also be used for work stations to protect it against losing data, and ensuring it continuing running even if a disk fails. Some recommendations on larger server setup is given at the end of the howto.

This requires some extra hardware, especially disks, and the description will also touch how to make the most out of the disks, be it in terms of available disk space, or input/output speed.

The text reflects work with 2.6.12 and 2.6.24 kernels, but may apply to kernels before and after these versions.


Creating partitions

We recommend creating partitions for /boot, root, swap and other file systems. This can be done by fdisk, parted or maybe a graphical interface like the Mandriva/PClinuxos harddrake2. It is recommended to use drives with equal sizes and performance characteristics. You can start with just making the layout of partitions of the first disk, set the type of each partition etc for this disk, and then copy the final layout to another disk with a single sfdisk command.

If we are using the 2 drives sda and sdb and we assume you have created partitions 1,2,3 and 5, then sfdisk may be used to make all the partitions into raid partitions (type fd):

  sfdisk -c /dev/sda 1 fd
  sfdisk -c /dev/sda 2 fd
  sfdisk -c /dev/sda 3 fd
  sfdisk -c /dev/sda 5 fd

Copy the final layout of sda to sdb:

  sfdisk -d /dev/sda | sfdisk /dev/sdb


  fdisk -l /dev/sda /dev/sdb

the partition layout could then look like this:

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1          37      297171   fd  Linux raid autodetect
/dev/sda2              38        1132     8795587+  fd  Linux raid autodetect
/dev/sda3            1133        1619     3911827+  fd  Linux raid autodetect
/dev/sda4            1620      121601   963755415    5  Extended
/dev/sda5            1620      121601   963755383+  fd  Linux raid autodetect
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1          37      297171   fd  Linux raid autodetect
/dev/sdb2              38        1132     8795587+  fd  Linux raid autodetect
/dev/sdb3            1133        1619     3911827+  fd  Linux raid autodetect
/dev/sdb4            1620      121601   963755415    5  Extended
/dev/sdb5            1620      121601   963755383+  fd  Linux raid autodetect

Prepare for boot

The system should be set up to boot from multiple devices, so that if one disk fails, the system can boot from another disk.

NOTE: if the first disk fails, some BIOS'es will consider the 2nd as hdc, while others will use the physical location. SATA drives may also be "moved," and udev may apply interesting and unintuitive names for the devices in these cases. Use of the "UID" notation to identify raid array members is therefore desirable.

Some BIOS'es will not allow to boot from more than one disk. If this is the case you may physically move the second disk to the cable of the first disk.

It is possible that the BIOS needs to be set up to be able to boot for multiple drives, also in case one of the disks are not present. It is recommended that you test your setup, once it is completed, and see if it works if you disconnect power for each of the disks involved. Do only this with power to the whole computer shut off.

On Intel-compatible hardware, there are two common boot loaders, grub and lilo. Both grub and lilo can only boot off a raid1 device. They cannot boot off any other software raid device type. The reason they can boot off the raid1 is that they see the raid1 as a normal disk, they only then use one of the disks when booting. The boot stage only involves loading the kernel with a initrd image as its root file system, so not much data is needed for this. The kernel, the initrd and other boot files can be put in a small /boot partition. We recommend something like 200 MB on an ext3 raid1.

Make the raid1 and ext3 filesystem:

  mdadm --create /dev/md1 --chunk=256 -R -l 1 -n 2 /dev/sda1 /dev/sdb1
  mke2fs -j /dev/md1


Remember to have the root entry in /etc/lilo.conf set to root=/dev/md1 and have the root partition set to /dev/md1 in /etc/fstab

Make each of the disks bootable by lilo:

  lilo -b /dev/sda
  lilo -b /dev/sdb

Do this with /dev/md1 mounted on /boot - and kernel, intird etc moved to the /boot file system. This method also allows us to use a raid type for the root file system, which is not necessarily raid1.


Make each of the disks bootable by grub

Grub has a naming scheme where it calls the first disk for hd0, also disks that are known in the file sysetm as /dev/sda and the like. hd0 is always the disk that grub boots from, therefore it is neccessary to rename each of the alternate boot devices via the 'device' command in the following script.

Here is a script to install grub on the Master Boot Record (MBR) of 2 disks: sda, sdb:

  grub --no-floppy <<EOF
  root (hd0,1)
  setup (hd0)
  device (hd0) /dev/sdb
  root (hd0,1)
  setup (hd0)

The root file system

Note that booting the root file system requires that the initrd file system has the required drivers for the SATA disk controllers and the raid types, and this is not standard for all initrd files, even for some of the biggest distributions. You may need to build your own initrd.

Here is an initrd 'init' script modified from a file generated with mkinitrd, sata.nv and sata_sil are the drivers for two sata controllers:

 echo "Loading linear.ko module"
 insmod /lib/linear.ko
 echo "Loading multipath.ko module"
 insmod /lib/multipath.ko
 echo "Loading raid0.ko module"
 insmod /lib/raid10.ko
 echo "Loading xor.ko module"
 insmod /lib/xor.ko
 echo "Loading raid5.ko module"
 insmod /lib/raid5.ko
 echo "Loading raid6.ko module"
 insmod /lib/raid6.ko
 echo "Loading jbd.ko module"
 insmod /lib/jbd.ko
 echo "Loading ext3.ko module"
 insmod /lib/ext3.ko
 echo "Loading scsi_mod.ko module"
 insmod /lib/scsi_mod.ko
 echo "Loading sd_mod.ko module"
 insmod /lib/sd_mod.ko
 echo "Loading libata.ko module"
 insmod /lib/libata.ko
 echo "Loading sata_nv.ko module"
 insmod /lib/sata_nv.ko
 echo "Loading sata_sil.ko module"
 insmod /lib/sata_sil.ko
 echo Mounting /proc filesystem
 mount -t proc /proc /proc
 echo Mounting sysfs
 mount -t sysfs none /sys
 echo Creating device files
 mountdev size=32M,mode=0755
 echo -n /sbin/hotplug > /proc/sys/kernel/hotplug
 mkdir /dev/.udevdb
 mkdevices /dev
 echo Activating md devices
 mknod /dev/md2 b 9 2
 mknod /dev/md/2 b 9 2
 mknod /dev/md1 b 9 1
 mknod /dev/md/1 b 9 1
 mknod /dev/md2 b 9 2
 mknod /dev/md/2 b 9 2
 mknod /dev/md3 b 9 3
 mknod /dev/md/3 b 9 3
 mknod /dev/md4 b 9 4
 mknod /dev/md/4 b 9 4
 echo Creating root device
 mkrootdev /dev/root
 echo Mounting root filesystem /dev/root with flags defaults,noatime
 mount -o defaults,noatime --ro -t ext3 /dev/root /sysroot
 echo Switching to new root
 switchroot --movedev /sysroot
 echo Initrd finished

The root file system can be on another raid than the /boot partition. We recommend an raid10,f2, as the root file system will mostly be reads, and the raid10,f2 raid type is the fastest for reads, while also sufficiently fast for writes. Other relevant raid types would be raid10,o2 or raid1.

It is recommended to use the udev file system, as this runs in RAM, and you thus can avoid a number of read and writes to disk.

It is recommended that all file systems are mounted with the noatime option, this avoids writing to the filesystem inodes every time a file has been read or written.

Make the raid10,f2 and ext3 filesystem:

  mdadm --create /dev/md2 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda2 /dev/sdb2
  mkfs -t ext3 /dev/md2

The swap file system

If a disk fails, where processes are swapped to, then all these processes fail. This may be vital processes for the system, or vital jobs on the system. You can prevent the failing of the processes by having the swap partitions on a raid. The swap area needed is normally relatively small compared to the overall disk space available, so we recommend the faster raid types over the more space economic. The raid10,f2 type seems to be the fastest here, other relevant raid types could be raid10,o2 or raid1.

To create a raid array, and make the swap partition directly on it:

  mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda3 /dev/sdb3
  mkswap /dev/md3

In /etc/fstab you then need the following line

  /dev/md3  swap swap defaults 0 0

WARNING: some "recovery" CDs will not use raid10 as swap. This may be a problem on small memory systems, and the swap may need to be started and enabled manually.

Note for further writing: Maybe something on /var and /tmp could go here.

The rest of the file systems

Other file systems can also be protected against one failing disk. Which technique to recommend depends on your purpose with the disk space. You may mix the different raid types if you have different types of use on the same server, for example a data base and servicing of large files from the same server. (This is one of the advantages of software raid over hardware raid: you may have different types of raids on a disk with a software raid, where a hardware raid only may take one type for the whole disk.)

Is disk capacity the main priority, and you have more than 2 drives, then raid5 is recommended. Raid5 only uses the space of 1 drive for securing the data, while raid1 and raid10 use at least half the capacity for duplicating the data. For example with 4 drives, raid5 provides 75 % of the total disk space as usable, while raid1 and raid10 at most (dependent on the number of copies) give a 50 % usability of the disk space. This becomes even better for raid5 with more disks, with 10 disks you only use 10 % for security.

Is speed your main priority, then raid10,f2 raid10,o2 or raid1 would give you most speed during normal operation. This even works if you only have 2 drives.

Is speed with a failed disk a concern, then raid10,o2 could be the choice, as raid10,f2 is somewhat slower in operation, when a disk has failed.


  mdadm --create /dev/md4 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sd[ab]5
  mdadm --create /dev/md4 --chunk=256 -R -l 10 -n 2 -p o2 /dev/sda5 /dev/sdb5
  mdadm --create /dev/md4 --chunk=256 -R -l  5 -n 4       /dev/sd[abcd]5


mdadm has an easy way to generate the /etc/mdadm.conf file, just do:

   mdadm --detail --scan  > /etc/mdadm.conf

You should have mdadm report if any errors happen. This can be done by adding a MAILADDR line in /etc/mdadm.conf

   echo "MAILADDR root" >> /etc/mdadm.conf

Or you could use an email address for the notification instead of root.

Start monitoring the raids eg by:

   mdadm --monitor --scan --daemonise

Test that email notification is done by

   mdadm --monitor --scan --test

Recommendations for the setup of larger servers

Given a larger server setup, with more disks, it is possible to survive more than one disk failure. The raid6 array type can be used to be able to survive 2 disk failures, at the expense of the space of 2 disks. The /boot, root and swap partitions can be set up with more disks, for example a /boot partition made up from a raid1 of 3 disks, and root and swap partitons made up from raid10,f3 arrays. Given that raid6 cannot survive more than the failures of 2 disks, the system disks need not be prepared for more than 2 disk failures either, and you can use the rest of the disk IO capacity to speed up the system.

Personal tools