RAID Boot

From Linux Raid Wiki
(Difference between revisions)
Jump to: navigation, search
(added an init script and notes)
(revert)
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Historically when the kernel boots it used a mechanism called 'autodetect' to identify partitions (marked as partition type 'fd') which are used in RAID arrays. It then attempts to automatically assemble and start these arrays.
+
Historically, when the kernel booted, it used a mechanism called 'autodetect' to identify partitions which are used in RAID arrays: it assumed that all partitions of type 0xfd are so used. It then attempted to automatically assemble and start these arrays.
  
This approach can cause problems in several situations (see various mailing list archives for the debates) and kernel autodetect is deprecated.
+
This approach can cause problems in several situations (imagine moving part of an old array onto another machine before wiping and repurposing it: reboot and watch in horror as the piece of dead array gets assembled as part of the running RAID array, ruining it); kernel autodetect is correspondingly deprecated.
  
The recommended approach now is to use 'initramfs'. This approach allows a great deal of flexibility in preparing the kernel for booting.
+
The recommended approach now is to use the initramfs system.
Essentially the normal kernal boot image is extended to contain an initram filesystem containing a variety of data and scripts. These scripts can do almost anything.
+
A typical usage is to include kernel modules and userspace configuration tools (such as mdadm). This is of interest to us because mdadm could, for example, examine the partitions and only assemble arrays with a defined UUID.
+
  
 +
This system is documented in more detail than you're likely to care about in the file Documentation/filesystems/ramfs-rootfs-initramfs.txt in the kernel source tree, but in brief it allows you to store in the kernel image a nonswappable in-memory filesystem (the 'rootfs') which is uncompressed as the root filesystem as the kernel boots; the kernel runs /init and leaves it to find the root filesystem, chroot into it, and execute the <i>real</i> /sbin/init. It's sort of like the old initrd system, only your image never gets out of sync with the kernel, it's much easier to build the image (the kernel build system can put it together for you), the kernel can always find it, and there's no overcomplicated scheme for switching to the real root filesystem: you can just chroot.
  
This script was written by Nix and posted to the linux-raid list:
+
This approach provides a great deal of flexibility: you can get your root filesystem from LVM layered over a RAID array stored on a dozen network block devices on machines in Gautemala, San Diego, and Tokyo if you really need to (although that particular combination might be a bit <i>slow</i> without careful use of 'write-mostly'). I've even heard that some people have a C compiler on there, and recompile third-party modules for the running kernel on the fly from the source code!
  
It has a number of improvements over the initramfs embedded in the
+
But this flexibility comes with a price, and getting the thing working in the first place is a bit tricky. If init isn't PID 1, you're in trouble; if you've left anything in the rootfs before chrooting, you've lost the memory it was in forever: if you mess up population you've got a useless kernel image; and populating it is sort of like working on an embedded system, because unless you want a 20Mb kernel image you'd better use small tools, like busybox, and preferably a small libc, like uClibc.
script that comes with mdadm:
+
  
* It handles LVM2 as well as md (obviously if you boot off RAID you still have to boot off RAID1, but /boot can be a RAID1 filesystem of its own now, with / in LVM, on RAID, or both at once)
+
But it's useful if you want to mount RAID arrays before booting: e.g., examining the partitions and assembling arrays with a defined UUID, while leaving you with enough emergency repair facilities to figure out what's wrong if assembly fails.
 +
 
 +
mdadm comes with such a script, but choice is good, so here's another. It has a number of improvements over the mdadm variation:
 +
 
 +
* It handles LVM2 as well as md (obviously if you boot off RAID you still have to boot off RAID1, but /boot can be a RAID1 filesystem of its own now, with / in LVM, on RAID, or both at once; you don't even need md on the machine anymore)
 +
* You can leave lvm or mdadm off the image if you don't need them, so you can use the same initramfs for many machines, only some of which use RAID or LVMed root filesystems
 
* It fscks / before mounting it
 
* It fscks / before mounting it
* If anything goes wrong, it drops you into an emergency shell in the rootfs, from where you have all the power of ash without hardly any builtin commands, lvm and mdadm to diagnose your problem :)  you can't do *that* with in-kernel array autodetection!
+
* If anything goes wrong, it drops you into an emergency shell in the rootfs, where you have all the power of ash with hardly any builtin commands, lvm and mdadm to diagnose your problem!
* it supports arguments `rescue', to drop into /bin/ash instead of init after mounting the real root filesystem, and `emergency', to drop into a shell on the initramfs before doing *anything*.
+
* it supports a number of arguments: 'rescue', to drop into /bin/ash instead of init after mounting the real root filesystem, 'emergency', to drop into a shell on the initramfs before doing <i>anything</i>, and 'trace', to turn on shell tracing early in the init script execution, so if something's failing with a bizarre error message, you can tell what it is. It also supports numeric arguments 1 to 5, `single' and `-b', which just get passed down to init.
* It supports root= and init= arguments, although for arcane reasons to do with LILO suckage you need to pass the root argument as `root=LABEL=/dev/some/device', or LILO will helpfully transform it into a device number, which is rarely useful if the device name is, say, /dev/emergency-volume-group/root ;)  right now, if you  don't pass root=, it tries to mount /dev/raid/root after  initializing all the RAID arrays and LVM VGs it can.
+
* It supports root= and init= arguments, although for arcane reasons to do with LILO suckage you need to pass the root argument as `root=LABEL=/dev/some/device', or LILO will helpfully transform it into a device number, which is rarely useful if the device name is, say, /dev/emergency-volume-group/root. It gets the default volume group name from a file `vgname' which you have to arrange to put on the initramfs (sticking it in the usr/ subdirectory in the kernel tree will do). It also supports root-type= and root-options= arguments, so you can mount root with noatime or force filesystem detection should you need to.
* it doesn't waste memory. initramfs isn't like initrd: if you just chroot into the new root filesystem, the data in the initramfs *stays around*, in *nonswappable* kernel memory. And it's not gzipped by that point, either!
+
The default VG name, root device name, mount options and filesystem type are all derived from the entry for / in /etc/fstab. This will generally do the right thing, unless your root filesystem isn't on LVM <i>and</i> you use a name like /dev/disk/by-label/root: the initramfs doesn't have udev on it and so won't understand such labels. You could fix this by passing the root= argument or by having a second fstab used just for the initramfs: both will work.
 +
* It doesn't waste memory. initramfs isn't like initrd: if you just chroot into the new root filesystem, the data in the initramfs <i>stays around</i>, in <i>nonswappable</i> kernel memory. And it's not gzipped by that point, either!
  
 
The downsides:
 
The downsides:
  
* it needs a very new busybox, from Subversion after the start of this year: I'm using svn://busybox.net/trunk/busybox revision 14406, and a 2.6.12+ kernel with sysfs and hotplug support; this is because it populates /dev with the `mdev' mini-udev tool inside busybox, and switches root filesystems with the `switch_root' tool, which chroots only after erasing the entire contents of the initramfs (taking *great* care not to recurse off that filesystem!)
+
* It needs busybox 1.2 or later, and a 2.6.12+ kernel with sysfs and hotplug support; this is because it populates /dev with the `mdev' mini-udev tool inside busybox, and switches root filesystems with the `switch_root' tool, which chroots only after erasing the entire contents of the initramfs (taking <i>great</i> care not to recurse off that filesystem!)
 +
* If you link against uClibc you'll need mdadm 2.5.2 or later: earlier versions will <i>crash</i>.
 
* if you link against uClibc (recommended), you need a CVS uClibc too (i.e., one newer than 0.9.27).
 
* if you link against uClibc (recommended), you need a CVS uClibc too (i.e., one newer than 0.9.27).
* it doesn't try to e.g. set up the network, so it can't do really whizzy things like mount a root filesystem situated on a network block device on some other host: if you want to do something like
+
* It doesn't try to e.g. set up the network: changing the script to do that isn't likely to be terribly difficult.
that you've probably already written a script to do it long ago
+
* You need an /etc/mdadm.conf (if using md) and an /etc/lvm/lvm.conf, both taken by default from the system you built the kernel on: personally I'd recommend a really simple one with no device= lines, like
* the init script's got a few too many things hardwired still, like the type of the root filesystem. I expect it's short enough to easily hack up if you need to  :)
+
* you need an /etc/mdadm.conf and an /etc/lvm/lvm.conf, both taken by default from the system you built the kernel on: personally I'd recommend a really simple one with no device= lines, like
+
  
 
<pre>
 
<pre>
Line 36: Line 39:
 
...
 
...
 
</pre>
 
</pre>
So here is a '''usr/init''' for your kernel...
+
 
 +
(I might change this to use --homehost in future, whereupon you'd only need to provide a file giving the hostname; but --homehost isn't widely-enough available yet, and I haven't tried it myself.)
 +
 
 +
Here's the init script:
 +
 
 
<pre>
 
<pre>
#!/bin/sh
+
#!/bin/ash
 
#
 
#
 
# init --- locate and mount root filesystem
 
# init --- locate and mount root filesystem
Line 53: Line 60:
  
 
# Populate /dev from /sys
 
# Populate /dev from /sys
 +
 
/bin/mount -t tmpfs tmpfs /dev
 
/bin/mount -t tmpfs tmpfs /dev
 
/sbin/mdev -s
 
/sbin/mdev -s
  
INIT_ARGS="$@"
+
# Locate the root filesystem's fstab entry; collapse spaces and tabs in it:
 +
# extract its significant components. (There are three raw tabs in the next
 +
# line, each next to a single space.)
 +
 
 +
FSENT=`sed -n '/[ ]\/[ ]/ { s,[ ][ ]*, ,g; p; }' < /etc/fstab`
 +
ROOT="`echo $FSENT | tr ' ' '\n' | sed -n '1p'`"
 +
TYPE="`echo $FSENT | tr ' ' '\n' | sed -n '3p'`"
 +
OPTS="`echo $FSENT | tr ' ' '\n' | sed -n '4p'`"
 +
 
 +
# Parse arguments, engaging trace mode or dropping to rescue or emergency shells
 +
# as needed. If there is a forced init program, root filesystem, root fs type or
 +
# root fs options, accept the forcing.
 +
 
 +
INIT_ARGS=
  
# If there is a forced root filesystem or init, accept the forcing
 
 
for param in $CMDLINE; do
 
for param in $CMDLINE; do
 
     case "$param" in
 
     case "$param" in
 
         init=*) eval "$param";;
 
         init=*) eval "$param";;
 +
-b|single|s|S|[1-5]) INIT_ARGS="$INIT_ARGS $param";;
 +
        trace) echo "Tracing init script.";
 +
              set -x;;
 
         rescue) echo "Rescue boot mode: invoking ash.";
 
         rescue) echo "Rescue boot mode: invoking ash.";
 
                 init=/bin/ash;
 
                 init=/bin/ash;
Line 68: Line 91:
 
                   echo "Reboot with Ctrl-Alt-Delete.";
 
                   echo "Reboot with Ctrl-Alt-Delete.";
 
                   exec /bin/sh;;
 
                   exec /bin/sh;;
         root=LABEL=*) root="`echo $param | cut -d= -f3-`";;
+
         root=LABEL=*) ROOT=$(echo $1 | cut -d= -f3-);;
 +
        root-type=*) TYPE=$(echo $1 | cut -d= -f2-);;
 +
        root-options=*) OPTS=$(echo $1 | cut -d= -f2-);;
 
     esac
 
     esac
 
done
 
done
  
# Assemble the RAID arrays.
+
# Assemble the RAID arrays. We enable all that we can find, because we can't
/sbin/mdadm --assemble --scan --auto=md --run
+
# be sure which of them are needed to assemble the VG on which the root
 +
# filesystem is located (if any). If you have RAID arrays which span devices
 +
# which are not yet accessible, you'll probably want to add --no-degraded here,
 +
# or build the initramfs with an mdadm.conf that does not mention the arrays
 +
# you don't want assembled at this point.
 +
#
 +
# Perhaps we want to avoid starting degraded arrays no matter what, but I'd
 +
# prefer my system to boot even if a drive fails.
 +
 
 +
if [ -x /sbin/mdadm ]; then
 +
    /sbin/mdadm --assemble --scan --auto=md
 +
fi
 +
 
 +
# If there are two slashes in the root filesystem location after the
 +
# leading slash (e.g. /dev/raid/root), we assume that the middle
 +
# component is the name of the volume group. Otherwise, we assume that
 +
# no VG is involved.
 +
 
 +
VGNAME=
 +
if [ "`echo $ROOT | sed 's,^/,,' | tr '/' '\n' | wc -l`" -eq 3 ]; then
 +
    VGNAME="`echo $ROOT | sed 's,^/,,' | tr '/' '\n' | sed -n '2p'`"
 +
fi
  
 
FAILED=
 
FAILED=
  
# Scan for volume groups.
+
# Scan for volume groups. We activate only the group on which the
/sbin/lvm vgscan --ignorelockingfailure --mknodes && /sbin/lvm vgchange -ay --ignorelockingfailure
+
# root filesystem is stored; the other groups may span devices which
 +
# are not yet accessible.
 +
 
 +
if [ -x /sbin/lvm -a -n $VGNAME ]; then
 +
    /sbin/lvm vgscan --ignorelockingfailure --mknodes && /sbin/lvm vgchange -ay --ignorelockingfailure $VGNAME
 +
fi
  
[[ -z $root ]] && root=/dev/raid/root
+
# Check the filesystem.
  
fsck -a $root
+
fsck -t $TYPE -a $ROOT
  
if [[ $? -eq 4 ]]; then
+
if [ $? -eq 4 ]; then
 
     echo "Filesystem errors left uncorrected."
 
     echo "Filesystem errors left uncorrected."
 
     echo
 
     echo
Line 92: Line 143:
 
fi
 
fi
  
if [[ -n $root ]]; then  
+
if [ -n $ROOT ]; then  
     /bin/mount -o rw -t ext3 $root /new-root
+
     if [ -n $OPTS ]; then
 +
        /bin/mount -o $OPTS -t $TYPE $ROOT /new-root
 +
    else
 +
        /bin/mount -t $TYPE $ROOT /new-root
 +
    fi
 
fi
 
fi
  
 
if /bin/mountpoint /new-root >/dev/null; then :; else
 
if /bin/mountpoint /new-root >/dev/null; then :; else
 
     echo "No root filesystem given to the kernel or found on the root RAID array."
 
     echo "No root filesystem given to the kernel or found on the root RAID array."
     echo "Append the correct 'root=' boot option."
+
     echo "Append the correct 'root=', 'root-type=', and/or 'root-options='"
 +
    echo "boot options."
 
     echo
 
     echo
 
     echo "Dropping to a minimal shell.  Reboot with Ctrl-Alt-Delete."
 
     echo "Dropping to a minimal shell.  Reboot with Ctrl-Alt-Delete."
Line 105: Line 161:
 
fi
 
fi
  
if [[ -z "$init" ]]; then
+
if [ -z "$init" ]; then
 
     init=/sbin/init
 
     init=/sbin/init
 
fi
 
fi
Line 117: Line 173:
 
echo "Switching to /new-root and running '$init'"
 
echo "Switching to /new-root and running '$init'"
 
exec switch_root /new-root $init $INIT_ARGS
 
exec switch_root /new-root $init $INIT_ARGS
 +
</pre>
  
 +
Here's the initramfs source script, by default named usr/initramfs; you'll need to adjust it to pick up tools from the right place (They have to be linked statically, those in the default location probably are linked dynamically). You can omit `mdadm' if you like, so you can use the same init script on machines with root on LVM and with root on LVM on RAID, changing only the initramfs source script. You can omit `lvm' similarly.
  
And usr/initramfs (will need adjustment for your system):
+
<pre>
 
+
 
#
 
#
 
# Files needed for early userspace.
 
# Files needed for early userspace.
Line 129: Line 186:
 
file /bin/busybox /usr/i686-pc-linux-uclibc/bin/busybox 0755 0 0
 
file /bin/busybox /usr/i686-pc-linux-uclibc/bin/busybox 0755 0 0
 
slink /bin/sh /bin/busybox 0755 0 0
 
slink /bin/sh /bin/busybox 0755 0 0
slink /bin/msh /bin/busybox 0755 0 0
+
slink /bin/ash /bin/busybox 0755 0 0
 
slink /bin/[ /bin/busybox 0755 0 0
 
slink /bin/[ /bin/busybox 0755 0 0
 
slink /bin/[[ /bin/busybox 0755 0 0
 
slink /bin/[[ /bin/busybox 0755 0 0
Line 136: Line 193:
 
slink /bin/umount /bin/busybox 0755 0 0
 
slink /bin/umount /bin/busybox 0755 0 0
 
slink /bin/cat /bin/busybox 0755 0 0
 
slink /bin/cat /bin/busybox 0755 0 0
slink /bin/ls /bin/busybox 0755 0 0
 
slink /bin/mountpoint /bin/busybox 0755 0 0
 
 
slink /bin/echo /bin/busybox 0755 0 0
 
slink /bin/echo /bin/busybox 0755 0 0
 
slink /bin/false /bin/busybox 0755 0 0
 
slink /bin/false /bin/busybox 0755 0 0
slink /bin/true /bin/busybox 0755 0 0
+
slink /bin/ls /bin/busybox 0755 0 0
 +
slink /bin/mountpoint /bin/busybox 0755 0 0
 
slink /bin/mkdir /bin/busybox 0755 0 0
 
slink /bin/mkdir /bin/busybox 0755 0 0
 +
slink /bin/sed /bin/busybox 0755 0 0
 +
slink /bin/true /bin/busybox 0755 0 0
 +
slink /bin/tr /bin/busybox 0755 0 0
 +
slink /bin/wc /bin/busybox 0755 0 0
 
dir /sbin 0755 0 0
 
dir /sbin 0755 0 0
 
slink /sbin/mdev /bin/busybox 0755 0 0
 
slink /sbin/mdev /bin/busybox 0755 0 0
Line 149: Line 209:
 
slink /sbin/fsck.ext3 /bin/busybox 0755 0 0
 
slink /sbin/fsck.ext3 /bin/busybox 0755 0 0
 
slink /sbin/switch_root /bin/busybox 0755 0 0
 
slink /sbin/switch_root /bin/busybox 0755 0 0
 +
# (the next two are optional)
 
file /sbin/mdadm /usr/i686-pc-linux-uclibc/sbin/mdadm 0755 0 0
 
file /sbin/mdadm /usr/i686-pc-linux-uclibc/sbin/mdadm 0755 0 0
 
file /sbin/lvm /usr/i686-pc-linux-uclibc/sbin/lvm 0755 0 0
 
file /sbin/lvm /usr/i686-pc-linux-uclibc/sbin/lvm 0755 0 0
Line 158: Line 219:
 
dir /new-root 0755 0 0
 
dir /new-root 0755 0 0
 
dir /etc 0755 0 0
 
dir /etc 0755 0 0
 +
file /etc/fstab /etc/fstab 0644 0 0
 
dir /etc/lvm 0755 0 0
 
dir /etc/lvm 0755 0 0
 +
# (the next two are optional)
 
file /etc/lvm/lvm.conf /etc/lvm/lvm.conf 0644 0 0
 
file /etc/lvm/lvm.conf /etc/lvm/lvm.conf 0644 0 0
 
file /etc/mdadm.conf /etc/mdadm.conf 0644 0 0
 
file /etc/mdadm.conf /etc/mdadm.conf 0644 0 0
Line 168: Line 231:
 
</pre>
 
</pre>
  
And the busybox config file I used for all this --- you *will* need to
+
Here's the busybox config file I used for all this. You <i>will</i> need to change the CROSS_COMPILER_PREFIX and the EXTRA_CFLAGS_OPTIONS, and you might want to build in more tools as well for use when things go really wrong, in emergency mode. This config file changed in critical ways on 2006-08-25: if you used an earlier version, you'll have to rebuild busybox:
change the CROSS_COMPILER_PREFIX and the EXTRA_CFLAGS_OPTIONS, and you
+
might want to build in more tools as well for use when things go wrong,
+
in emergency mode:
+
  
 
<pre>
 
<pre>
Line 185: Line 245:
 
CONFIG_INSTALL_APPLET_SYMLINKS=y
 
CONFIG_INSTALL_APPLET_SYMLINKS=y
 
PREFIX="./_install"
 
PREFIX="./_install"
 +
CONFIG_MD5_SIZE_VS_SPEED=2
 
CONFIG_CAT=y
 
CONFIG_CAT=y
 
CONFIG_CUT=y
 
CONFIG_CUT=y
Line 192: Line 253:
 
CONFIG_MKDIR=y
 
CONFIG_MKDIR=y
 
CONFIG_TEST=y
 
CONFIG_TEST=y
 +
CONFIG_TR=y
 
CONFIG_TRUE=y
 
CONFIG_TRUE=y
 +
CONFIG_WC=y
 
CONFIG_FEATURE_AUTOWIDTH=y
 
CONFIG_FEATURE_AUTOWIDTH=y
 +
CONFIG_SED=y
 
CONFIG_E2FSCK=y
 
CONFIG_E2FSCK=y
 
CONFIG_FSCK=y
 
CONFIG_FSCK=y
Line 202: Line 266:
 
CONFIG_UMOUNT=y
 
CONFIG_UMOUNT=y
 
CONFIG_MOUNTPOINT=y
 
CONFIG_MOUNTPOINT=y
CONFIG_FEATURE_SH_IS_MSH=y
+
CONFIG_FEATURE_SH_IS_ASH=y
CONFIG_MSH=y
+
CONFIG_ASH=y
 +
CONFIG_ASH_OPTIMIZE_FOR_SIZE=y
 
CONFIG_FEATURE_SH_EXTRA_QUIET=y
 
CONFIG_FEATURE_SH_EXTRA_QUIET=y
 
CONFIG_FEATURE_SH_STANDALONE_SHELL=y
 
CONFIG_FEATURE_SH_STANDALONE_SHELL=y
Line 211: Line 276:
 
CONFIG_FEATURE_COMMAND_TAB_COMPLETION=y
 
CONFIG_FEATURE_COMMAND_TAB_COMPLETION=y
 
CONFIG_FEATURE_IPC_SYSLOG_BUFFER_SIZE=0
 
CONFIG_FEATURE_IPC_SYSLOG_BUFFER_SIZE=0
CONFIG_MD5_SIZE_VS_SPEED=2
 
 
</pre>
 
</pre>

Latest revision as of 09:07, 9 July 2008

Historically, when the kernel booted, it used a mechanism called 'autodetect' to identify partitions which are used in RAID arrays: it assumed that all partitions of type 0xfd are so used. It then attempted to automatically assemble and start these arrays.

This approach can cause problems in several situations (imagine moving part of an old array onto another machine before wiping and repurposing it: reboot and watch in horror as the piece of dead array gets assembled as part of the running RAID array, ruining it); kernel autodetect is correspondingly deprecated.

The recommended approach now is to use the initramfs system.

This system is documented in more detail than you're likely to care about in the file Documentation/filesystems/ramfs-rootfs-initramfs.txt in the kernel source tree, but in brief it allows you to store in the kernel image a nonswappable in-memory filesystem (the 'rootfs') which is uncompressed as the root filesystem as the kernel boots; the kernel runs /init and leaves it to find the root filesystem, chroot into it, and execute the real /sbin/init. It's sort of like the old initrd system, only your image never gets out of sync with the kernel, it's much easier to build the image (the kernel build system can put it together for you), the kernel can always find it, and there's no overcomplicated scheme for switching to the real root filesystem: you can just chroot.

This approach provides a great deal of flexibility: you can get your root filesystem from LVM layered over a RAID array stored on a dozen network block devices on machines in Gautemala, San Diego, and Tokyo if you really need to (although that particular combination might be a bit slow without careful use of 'write-mostly'). I've even heard that some people have a C compiler on there, and recompile third-party modules for the running kernel on the fly from the source code!

But this flexibility comes with a price, and getting the thing working in the first place is a bit tricky. If init isn't PID 1, you're in trouble; if you've left anything in the rootfs before chrooting, you've lost the memory it was in forever: if you mess up population you've got a useless kernel image; and populating it is sort of like working on an embedded system, because unless you want a 20Mb kernel image you'd better use small tools, like busybox, and preferably a small libc, like uClibc.

But it's useful if you want to mount RAID arrays before booting: e.g., examining the partitions and assembling arrays with a defined UUID, while leaving you with enough emergency repair facilities to figure out what's wrong if assembly fails.

mdadm comes with such a script, but choice is good, so here's another. It has a number of improvements over the mdadm variation:

  • It handles LVM2 as well as md (obviously if you boot off RAID you still have to boot off RAID1, but /boot can be a RAID1 filesystem of its own now, with / in LVM, on RAID, or both at once; you don't even need md on the machine anymore)
  • You can leave lvm or mdadm off the image if you don't need them, so you can use the same initramfs for many machines, only some of which use RAID or LVMed root filesystems
  • It fscks / before mounting it
  • If anything goes wrong, it drops you into an emergency shell in the rootfs, where you have all the power of ash with hardly any builtin commands, lvm and mdadm to diagnose your problem!
  • it supports a number of arguments: 'rescue', to drop into /bin/ash instead of init after mounting the real root filesystem, 'emergency', to drop into a shell on the initramfs before doing anything, and 'trace', to turn on shell tracing early in the init script execution, so if something's failing with a bizarre error message, you can tell what it is. It also supports numeric arguments 1 to 5, `single' and `-b', which just get passed down to init.
  • It supports root= and init= arguments, although for arcane reasons to do with LILO suckage you need to pass the root argument as `root=LABEL=/dev/some/device', or LILO will helpfully transform it into a device number, which is rarely useful if the device name is, say, /dev/emergency-volume-group/root. It gets the default volume group name from a file `vgname' which you have to arrange to put on the initramfs (sticking it in the usr/ subdirectory in the kernel tree will do). It also supports root-type= and root-options= arguments, so you can mount root with noatime or force filesystem detection should you need to.

The default VG name, root device name, mount options and filesystem type are all derived from the entry for / in /etc/fstab. This will generally do the right thing, unless your root filesystem isn't on LVM and you use a name like /dev/disk/by-label/root: the initramfs doesn't have udev on it and so won't understand such labels. You could fix this by passing the root= argument or by having a second fstab used just for the initramfs: both will work.

  • It doesn't waste memory. initramfs isn't like initrd: if you just chroot into the new root filesystem, the data in the initramfs stays around, in nonswappable kernel memory. And it's not gzipped by that point, either!

The downsides:

  • It needs busybox 1.2 or later, and a 2.6.12+ kernel with sysfs and hotplug support; this is because it populates /dev with the `mdev' mini-udev tool inside busybox, and switches root filesystems with the `switch_root' tool, which chroots only after erasing the entire contents of the initramfs (taking great care not to recurse off that filesystem!)
  • If you link against uClibc you'll need mdadm 2.5.2 or later: earlier versions will crash.
  • if you link against uClibc (recommended), you need a CVS uClibc too (i.e., one newer than 0.9.27).
  • It doesn't try to e.g. set up the network: changing the script to do that isn't likely to be terribly difficult.
  • You need an /etc/mdadm.conf (if using md) and an /etc/lvm/lvm.conf, both taken by default from the system you built the kernel on: personally I'd recommend a really simple one with no device= lines, like
DEVICE partitions
ARRAY /dev/md0 UUID=some:long:uuid:here
ARRAY /dev/md1 UUID=another:long:uuid:here
ARRAY /dev/md2 UUID=yetanother:long:uuid:here
...

(I might change this to use --homehost in future, whereupon you'd only need to provide a file giving the hostname; but --homehost isn't widely-enough available yet, and I haven't tried it myself.)

Here's the init script:

#!/bin/ash
#
# init --- locate and mount root filesystem
#          By Nix <nix@esperi.org.uk>.
#
#          Placed in the public domain.
#

export PATH=/sbin:/bin

/bin/mount -t proc proc /proc
/bin/mount -t sysfs sysfs /sys
CMDLINE=`cat /proc/cmdline`

# Populate /dev from /sys

/bin/mount -t tmpfs tmpfs /dev
/sbin/mdev -s

# Locate the root filesystem's fstab entry; collapse spaces and tabs in it:
# extract its significant components. (There are three raw tabs in the next
# line, each next to a single space.)

FSENT=`sed -n '/[ 	]\/[ 	]/ { s,[ 	][ 	]*, ,g; p; }' < /etc/fstab`
ROOT="`echo $FSENT | tr ' ' '\n' | sed -n '1p'`"
TYPE="`echo $FSENT | tr ' ' '\n' | sed -n '3p'`"
OPTS="`echo $FSENT | tr ' ' '\n' | sed -n '4p'`"

# Parse arguments, engaging trace mode or dropping to rescue or emergency shells
# as needed. If there is a forced init program, root filesystem, root fs type or
# root fs options, accept the forcing.

INIT_ARGS=

for param in $CMDLINE; do
    case "$param" in
        init=*) eval "$param";;
	-b|single|s|S|[1-5]) INIT_ARGS="$INIT_ARGS $param";;
        trace) echo "Tracing init script.";
               set -x;;
        rescue) echo "Rescue boot mode: invoking ash.";
                init=/bin/ash;
                INIT_ARGS="-";;
        emergency) echo "Emergency boot mode. Dropping to a minimal shell.";
                   echo "Reboot with Ctrl-Alt-Delete.";
                   exec /bin/sh;;
        root=LABEL=*) ROOT=$(echo $1 | cut -d= -f3-);;
        root-type=*) TYPE=$(echo $1 | cut -d= -f2-);;
        root-options=*) OPTS=$(echo $1 | cut -d= -f2-);;
    esac
done

# Assemble the RAID arrays. We enable all that we can find, because we can't
# be sure which of them are needed to assemble the VG on which the root
# filesystem is located (if any). If you have RAID arrays which span devices
# which are not yet accessible, you'll probably want to add --no-degraded here,
# or build the initramfs with an mdadm.conf that does not mention the arrays
# you don't want assembled at this point.
#
# Perhaps we want to avoid starting degraded arrays no matter what, but I'd
# prefer my system to boot even if a drive fails.

if [ -x /sbin/mdadm ]; then
    /sbin/mdadm --assemble --scan --auto=md
fi

# If there are two slashes in the root filesystem location after the
# leading slash (e.g. /dev/raid/root), we assume that the middle
# component is the name of the volume group. Otherwise, we assume that
# no VG is involved.

VGNAME=
if [ "`echo $ROOT | sed 's,^/,,' | tr '/' '\n' | wc -l`" -eq 3 ]; then
    VGNAME="`echo $ROOT | sed 's,^/,,' | tr '/' '\n' | sed -n '2p'`"
fi

FAILED=

# Scan for volume groups. We activate only the group on which the
# root filesystem is stored; the other groups may span devices which
# are not yet accessible.

if [ -x /sbin/lvm -a -n $VGNAME ]; then
    /sbin/lvm vgscan --ignorelockingfailure --mknodes && /sbin/lvm vgchange -ay --ignorelockingfailure $VGNAME
fi

# Check the filesystem.

fsck -t $TYPE -a $ROOT

if [ $? -eq 4 ]; then
    echo "Filesystem errors left uncorrected."
    echo
    echo "Dropping to a minimal shell.  Reboot with Ctrl-Alt-Delete."

    exec /bin/sh
fi

if [ -n $ROOT ]; then 
    if [ -n $OPTS ]; then
        /bin/mount -o $OPTS -t $TYPE $ROOT /new-root
    else
        /bin/mount -t $TYPE $ROOT /new-root
    fi
fi

if /bin/mountpoint /new-root >/dev/null; then :; else
    echo "No root filesystem given to the kernel or found on the root RAID array."
    echo "Append the correct 'root=', 'root-type=', and/or 'root-options='"
    echo "boot options."
    echo
    echo "Dropping to a minimal shell.  Reboot with Ctrl-Alt-Delete."

    exec /bin/sh
fi

if [ -z "$init" ]; then
    init=/sbin/init
fi

# Unmount everything and switch root filesystems for good:
# exec the real init and begin the real boot process.
/bin/umount -l /proc
/bin/umount -l /sys
/bin/umount -l /dev

echo "Switching to /new-root and running '$init'"
exec switch_root /new-root $init $INIT_ARGS

Here's the initramfs source script, by default named usr/initramfs; you'll need to adjust it to pick up tools from the right place (They have to be linked statically, those in the default location probably are linked dynamically). You can omit `mdadm' if you like, so you can use the same init script on machines with root on LVM and with root on LVM on RAID, changing only the initramfs source script. You can omit `lvm' similarly.

#
# Files needed for early userspace.
# Placed in the public domain.
#

dir /bin 0755 0 0
file /bin/busybox /usr/i686-pc-linux-uclibc/bin/busybox 0755 0 0
slink /bin/sh /bin/busybox 0755 0 0
slink /bin/ash /bin/busybox 0755 0 0
slink /bin/[ /bin/busybox 0755 0 0
slink /bin/[[ /bin/busybox 0755 0 0
slink /bin/test /bin/busybox 0755 0 0
slink /bin/mount /bin/busybox 0755 0 0
slink /bin/umount /bin/busybox 0755 0 0
slink /bin/cat /bin/busybox 0755 0 0
slink /bin/echo /bin/busybox 0755 0 0
slink /bin/false /bin/busybox 0755 0 0
slink /bin/ls /bin/busybox 0755 0 0
slink /bin/mountpoint /bin/busybox 0755 0 0
slink /bin/mkdir /bin/busybox 0755 0 0
slink /bin/sed /bin/busybox 0755 0 0
slink /bin/true /bin/busybox 0755 0 0
slink /bin/tr /bin/busybox 0755 0 0
slink /bin/wc /bin/busybox 0755 0 0
dir /sbin 0755 0 0
slink /sbin/mdev /bin/busybox 0755 0 0
slink /sbin/fsck /bin/busybox 0755 0 0
slink /sbin/e2fsck /bin/busybox 0755 0 0
slink /sbin/fsck.ext2 /bin/busybox 0755 0 0
slink /sbin/fsck.ext3 /bin/busybox 0755 0 0
slink /sbin/switch_root /bin/busybox 0755 0 0
# (the next two are optional)
file /sbin/mdadm /usr/i686-pc-linux-uclibc/sbin/mdadm 0755 0 0
file /sbin/lvm /usr/i686-pc-linux-uclibc/sbin/lvm 0755 0 0
file /init usr/init 0755 0 0

# supporting directories
dir /proc 0755 0 0
dir /sys 0755 0 0
dir /new-root 0755 0 0
dir /etc 0755 0 0
file /etc/fstab /etc/fstab 0644 0 0
dir /etc/lvm 0755 0 0
# (the next two are optional)
file /etc/lvm/lvm.conf /etc/lvm/lvm.conf 0644 0 0
file /etc/mdadm.conf /etc/mdadm.conf 0644 0 0

# initial device files required (mdev creates the rest)
dir /dev 0755 0 0
nod /dev/console 0600 0 0 c 5 1
nod /dev/null 0666 0 0 c 1 3

Here's the busybox config file I used for all this. You will need to change the CROSS_COMPILER_PREFIX and the EXTRA_CFLAGS_OPTIONS, and you might want to build in more tools as well for use when things go really wrong, in emergency mode. This config file changed in critical ways on 2006-08-25: if you used an earlier version, you'll have to rebuild busybox:

HAVE_DOT_CONFIG=y
CONFIG_FEATURE_BUFFERS_USE_MALLOC=y
CONFIG_FEATURE_DEVPTS=y
CONFIG_STATIC=y
CONFIG_LFS=y
USING_CROSS_COMPILER=y
CROSS_COMPILER_PREFIX="/usr/bin/i686-pc-linux-uclibc-"
EXTRA_CFLAGS_OPTIONS="-march=pentium3 -fomit-frame-pointer"
CONFIG_INSTALL_NO_USR=y
CONFIG_INSTALL_APPLET_SYMLINKS=y
PREFIX="./_install"
CONFIG_MD5_SIZE_VS_SPEED=2
CONFIG_CAT=y
CONFIG_CUT=y
CONFIG_ECHO=y
CONFIG_FALSE=y
CONFIG_LS=y
CONFIG_MKDIR=y
CONFIG_TEST=y
CONFIG_TR=y
CONFIG_TRUE=y
CONFIG_WC=y
CONFIG_FEATURE_AUTOWIDTH=y
CONFIG_SED=y
CONFIG_E2FSCK=y
CONFIG_FSCK=y
FDISK_SUPPORT_LARGE_DISKS=y
CONFIG_MDEV=y
CONFIG_MOUNT=y
CONFIG_SWITCH_ROOT=y
CONFIG_UMOUNT=y
CONFIG_MOUNTPOINT=y
CONFIG_FEATURE_SH_IS_ASH=y
CONFIG_ASH=y
CONFIG_ASH_OPTIMIZE_FOR_SIZE=y
CONFIG_FEATURE_SH_EXTRA_QUIET=y
CONFIG_FEATURE_SH_STANDALONE_SHELL=y
CONFIG_FEATURE_COMMAND_EDITING=y
CONFIG_FEATURE_COMMAND_EDITING_VI=y
CONFIG_FEATURE_COMMAND_HISTORY=15
CONFIG_FEATURE_COMMAND_TAB_COMPLETION=y
CONFIG_FEATURE_IPC_SYSLOG_BUFFER_SIZE=0
Personal tools