Preventing against a failing disk
The following will describe how to prepare a system to survive if one disk fails. This can be important for a server which is intended to always run. The description is mostly aimed at small servers, but it can also be used for work stations to protect it against losing data, and ensuring it continuing running even if a disk fails. Some recommendations on larger server setup is given at the end of the howto.
This requires some extra hardware, especially disks, and the description will also touch how to make the most out of the disks, be it in terms of available disk space, or input/output speed.
The text reflects work with 2.6.12 and 2.6.24 kernels, but may apply to kernels before and after these versions.
We recommend creating partitions for /boot, root, swap and other file systems. This can be done by fdisk, parted or maybe a graphical interface like the Mandriva/PClinuxos harddrake2. It is recommended to use drives with equal sizes and performance characteristics. You can start with just making the layout of partitions of the first disk, set the type of each partition etc for this disk, and then copy the final layout to another disk with a single sfdisk command.
If we are using the 2 drives sda and sdb and we assume you have created partitions 1,2,3 and 5, then sfdisk may be used to make all the partitions into raid partitions (type fd):
sfdisk -c /dev/sda 1 fd sfdisk -c /dev/sda 2 fd sfdisk -c /dev/sda 3 fd sfdisk -c /dev/sda 5 fd
Copy the final layout of sda to sdb:
sfdisk -d /dev/sda | sfdisk /dev/sdb
fdisk -l /dev/sda /dev/sdb
the partition layout could then look like this:
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System /dev/sda1 1 37 297171 fd Linux raid autodetect /dev/sda2 38 1132 8795587+ fd Linux raid autodetect /dev/sda3 1133 1619 3911827+ fd Linux raid autodetect /dev/sda4 1620 121601 963755415 5 Extended /dev/sda5 1620 121601 963755383+ fd Linux raid autodetect
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System /dev/sdb1 1 37 297171 fd Linux raid autodetect /dev/sdb2 38 1132 8795587+ fd Linux raid autodetect /dev/sdb3 1133 1619 3911827+ fd Linux raid autodetect /dev/sdb4 1620 121601 963755415 5 Extended /dev/sdb5 1620 121601 963755383+ fd Linux raid autodetect
Prepare for boot
The system should be set up to boot from multiple devices, so that if one disk fails, the system can boot from another disk.
NOTE: if the first disk fails, some BIOS'es will consider the 2nd as hdc, while others will use the physical location. SATA drives may also be "moved," and udev may apply interesting and unintuitive names for the devices in these cases. Use of the "UID" notation to identify raid array members is therefore desirable.
Some BIOS'es will not allow to boot from more than one disk. If this is the case you may physically move the second disk to the cable of the first disk.
It is possible that the BIOS needs to be set up to be able to boot for multiple drives, also in case one of the disks are not present. It is recommended that you test your setup, once it is completed, and see if it works if you disconnect power for each of the disks involved. Do only this with power to the whole computer shut off.
On Intel-compatible hardware, there are two common boot loaders, grub and lilo. Both grub and lilo can only boot off a raid1 device. They cannot boot off any other software raid device type. The reason they can boot off the raid1 is that they see the raid1 as a normal disk, they only then use one of the disks when booting. The boot stage only involves loading the kernel with a initrd image as its root file system, so not much data is needed for this. The kernel, the initrd and other boot files can be put in a small /boot partition. We recommend something like 200 MB on an ext3 raid1.
Make the raid1 and ext3 filesystem:
mdadm --create /dev/md1 --chunk=256 -R -l 1 -n 2 /dev/sda1 /dev/sdb1 mke2fs -j /dev/md1
Remember to have the root entry in /etc/lilo.conf set to root=/dev/md1 and have the root partition set to /dev/md1 in /etc/fstab
Make each of the disks bootable by lilo:
lilo -b /dev/sda lilo -b /dev/sdb
Do this with /dev/md1 mounted on /boot - and kernel, intird etc moved to the /boot file system. This method also allows us to use a raid type for the root file system, which is not necessarily raid1.
Make each of the disks bootable by grub
Grub has a naming scheme where it calls the first disk for hd0, also disks that are known in the file sysetm as /dev/sda and the like. hd0 is always the disk that grub boots from, therefore it is neccessary to rename each of the alternate boot devices via the 'device' command in the following script.
Here is a script to install grub on the Master Boot Record (MBR) of 2 disks: sda, sdb:
grub --no-floppy <<EOF root (hd0,1) setup (hd0) device (hd0) /dev/sdb root (hd0,1) setup (hd0) EOF
The root file system
Note that booting the root file system requires that the initrd file system has the required drivers for the SATA disk controllers and the raid types, and this is not standard for all initrd files, even for some of the biggest distributions. You may need to build your own initrd.
Here is an initrd 'init' script modified from a file generated with mkinitrd, sata.nv and sata_sil are the drivers for two sata controllers:
#!/bin/nash echo "Loading linear.ko module" insmod /lib/linear.ko echo "Loading multipath.ko module" insmod /lib/multipath.ko echo "Loading raid0.ko module" insmod /lib/raid10.ko echo "Loading xor.ko module" insmod /lib/xor.ko echo "Loading raid5.ko module" insmod /lib/raid5.ko echo "Loading raid6.ko module" insmod /lib/raid6.ko echo "Loading jbd.ko module" insmod /lib/jbd.ko echo "Loading ext3.ko module" insmod /lib/ext3.ko echo "Loading scsi_mod.ko module" insmod /lib/scsi_mod.ko echo "Loading sd_mod.ko module" insmod /lib/sd_mod.ko echo "Loading libata.ko module" insmod /lib/libata.ko echo "Loading sata_nv.ko module" insmod /lib/sata_nv.ko echo "Loading sata_sil.ko module" insmod /lib/sata_sil.ko echo Mounting /proc filesystem mount -t proc /proc /proc echo Mounting sysfs mount -t sysfs none /sys echo Creating device files mountdev size=32M,mode=0755 echo -n /sbin/hotplug > /proc/sys/kernel/hotplug mkdir /dev/.udevdb mkdevices /dev echo Activating md devices mknod /dev/md2 b 9 2 mknod /dev/md/2 b 9 2 mknod /dev/md1 b 9 1 mknod /dev/md/1 b 9 1 mknod /dev/md2 b 9 2 mknod /dev/md/2 b 9 2 mknod /dev/md3 b 9 3 mknod /dev/md/3 b 9 3 mknod /dev/md4 b 9 4 mknod /dev/md/4 b 9 4 mdassemble echo Creating root device mkrootdev /dev/root resume echo Mounting root filesystem /dev/root with flags defaults,noatime mount -o defaults,noatime --ro -t ext3 /dev/root /sysroot echo Switching to new root switchroot --movedev /sysroot showlabels echo Initrd finished
The root file system can be on another raid than the /boot partition. We recommend an raid10,f2, as the root file system will mostly be reads, and the raid10,f2 raid type is the fastest for reads, while also sufficiently fast for writes. Other relevant raid types would be raid10,o2 or raid1.
It is recommended to use the udev file system, as this runs in RAM, and you thus can avoid a number of read and writes to disk.
It is recommended that all file systems are mounted with the noatime option, this avoids writing to the filesystem inodes every time a file has been read or written.
Make the raid10,f2 and ext3 filesystem:
mdadm --create /dev/md2 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda2 /dev/sdb2 mkfs -t ext3 /dev/md2
The swap file system
If a disk fails, where processes are swapped to, then all these processes fail. This may be vital processes for the system, or vital jobs on the system. You can prevent the failing of the processes by having the swap partitions on a raid. The swap area needed is normally relatively small compared to the overall disk space available, so we recommend the faster raid types over the more space economic. The raid10,f2 type seems to be the fastest here, other relevant raid types could be raid10,o2 or raid1.
To create a raid array, and make the swap partition directly on it:
mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda3 /dev/sdb3 mkswap /dev/md3
In /etc/fstab you then need the following line
/dev/md3 swap swap defaults 0 0
WARNING: some "recovery" CDs will not use raid10 as swap. This may be a problem on small memory systems, and the swap may need to be started and enabled manually.
Note for further writing: Maybe something on /var and /tmp could go here.
The rest of the file systems
Other file systems can also be protected against one failing disk. Which technique to recommend depends on your purpose with the disk space. You may mix the different raid types if you have different types of use on the same server, for example a data base and servicing of large files from the same server. (This is one of the advantages of software raid over hardware raid: you may have different types of raids on a disk with a software raid, where a hardware raid only may take one type for the whole disk.)
Is disk capacity the main priority, and you have more than 2 drives, then raid5 is recommended. Raid5 only uses the space of 1 drive for securing the data, while raid1 and raid10 use at least half the capacity for duplicating the data. For example with 4 drives, raid5 provides 75 % of the total disk space as usable, while raid1 and raid10 at most (dependent on the number of copies) give a 50 % usability of the disk space. This becomes even better for raid5 with more disks, with 10 disks you only use 10 % for security.
Is speed your main priority, then raid10,f2 raid10,o2 or raid1 would give you most speed during normal operation. This even works if you only have 2 drives.
Is speed with a failed disk a concern, then raid10,o2 could be the choice, as raid10,f2 is somewhat slower in operation, when a disk has failed.
mdadm --create /dev/md4 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sd[ab]5 mdadm --create /dev/md4 --chunk=256 -R -l 10 -n 2 -p o2 /dev/sda5 /dev/sdb5 mdadm --create /dev/md4 --chunk=256 -R -l 5 -n 4 /dev/sd[abcd]5
mdadm has an easy way to generate the /etc/mdadm.conf file, just do:
mdadm --detail --scan > /etc/mdadm.conf
You should have mdadm report if any errors happen. This can be done by adding a MAILADDR line in /etc/mdadm.conf
echo "MAILADDR root" >> /etc/mdadm.conf
Or you could use an email address for the notification instead of root.
Start monitoring the raids eg by:
mdadm --monitor --scan --daemonise
Test that email notification is done by
mdadm --monitor --scan --test
Recommendations for the setup of larger servers
Given a larger server setup, with more disks, it is possible to survive more than one disk failure. The raid6 array type can be used to be able to survive 2 disk failures, at the expense of the space of 2 disks. The /boot, root and swap partitions can be set up with more disks, for example a /boot partition made up from a raid1 of 3 disks, and root and swap partitons made up from raid10,f3 arrays. Given that raid6 cannot survive more than the failures of 2 disks, the system disks need not be prepared for more than 2 disk failures either, and you can use the rest of the disk IO capacity to speed up the system.