RAID Recovery

From Linux Raid Wiki
Jump to: navigation, search

Notice: The pages "RAID Recovery" and "Recovering a failed software_RAID" both cover this topic. Make sure to read both before attempting anything.

Contents

When Things Go Wrong

There are two kinds of failure with RAID systems: failures that reduce the resilience and failures that prevent the raid device from operating.

Normally a single disk failure will degrade the raid device but it will continue operating (that is the point of RAID after all).

However there will come a point when enough component devices fail that the raid device stops working.

If this happens then first of all: don't panic. Seriously. Don't rush into anything; don't issue any commands that will write to the disks (like mdadm -C , fsck or even mount etc).

The first thing to do is to start to preserve information. You'll need data from /var/log/messages, dmesg etc.

If there's valuable data on your array (and you don't have backups - which is a terrible sin, no matter the RAID level you run and no matter how good your component devices are supposed to be - but nevertheless commonly observed), it's probably best to tap into the community's knowledge of how to deal with a broken array first. Make sure to contact someone who's experienced in dealing with md RAID problems and failures - either someone you know in person, or, alternatively, the friendly and helpful folk linux-raid mailing list. Don't try to fix things in a trial-and-error manner, have someone review the measures you're about to take to get your data back before putting them into practise.

There are bugs in older versions of mdadm, and a lot of "stable" operating system releases ship with really old mdadm versions. Recent versions are 3.2.x and 3.3.x. It's advisable if you run into problems assembling your raid to upgrade to the latest git version of mdadm. If you can get the raid to successfully assemble and recover with the git version, then it's fine to use your old mdadm version again for normal system operation. Newer mdadm doesn't make any changes to the array that isn't backwards compatible.

Preserving RAID superblock information

One of the most useful things to do first, when trying to recover a broken RAID array, is to preserve the information reported in the RAID superblocks on each device at the time the array went down (and before you start trying to recreate the array). Something like

mdadm --examine /dev/sd[bcdefghijklmn]1 >> raid.status

(adjust this to suit your drives) creates a file, raid.status, which is a sequential listing of the mdadm --examine output for all the RAID devices on my system, in order. The file should also still be there five minutes later when we start messing with mdadm --create, which is the point.

Trying to assemble using --force

If your array won't assemble automatically, the first thing to check the reason for this (look into the logs using "dmesg" or check the log files). It's a frequent failure scenario that the event count of the devices do not match, which means mdadm won't assemble the array automatically. The event count is increased when writes are done to an array, so if the event count differs by less than 50, then the information on the drive is probably still ok. The higher difference, the more writes have been done to the filesystem and the greater the risk that the filesystem will have changed a lot since the differing event count drive was last in the array, and the higher the risk that your data is in jeopardy.

Use mdadm --examine on all devices and check their event count:

mdadm --examine /dev/sd[a-z] | egrep 'Event|/dev/sd'

If the event count closely matches but not exactly, use "mdadm --assemble --force /dev/mdX <list of devices>" to force mdadm to assemble the array anyway using the devices with the closest possible event count. If the event count of a drive is way off, this probably means that drive has been out of the array for a long time and shouldn't be included in the assembly. Re-add it after the assembly so it's sync:ed up using information from the drives with closest event counts.

Recreating an array

When an array is created, superblocks are written to the drive and according to the defaults of mdadm, a certain area of the drive is now considered "data area". The data areas (that might or might not be correct) are not written to, *provided* the array is created in degraded mode; that is with a 'missing' device. If the wrong superblock version is chosen, wrong data offset (internal default value which has changed over time in mdadm), chunk size (also value that has changed over time), then the data area will not match what was previously on the drives. The md superblock might have overwritten part of your data. Use with caution!

So if you somehow screw up your array and can't remember how it was originally created, you can re-run the create command using various permutations until the data is readable.

This perl script is an un-tested prototype : permute_array.pl

Restore array by recreating (after multiple device failure)

Recreating should be considered a *last* resort, only to be used when everything else fails. People getting this wrong is one of the primary reasons people lose data. It is very commonly used way too early in the fault finding process. You have been warned! It's better to send an email to the linux-raid mailing list with detailed information (mdadm --examine from all component drives plus log entries from when the failure happened, including mdadm and kernel version) and ask for advice than to try to use --create --assume-clean and getting it wrong.

It's important to know that when doing --create, mdadm defaults for chunk size, data offset, superblock version and other parameters, have changed over time. Using an mdadm with different defaults than the one originally used to create the array will result in the array being completely unaccessible.

This section applies to a RAID5 that has temporarily lost more than one device and cannot be assembled with --force, or a RAID6 that has lost more than two devices, and cannot be assembled without using --force because the devices are out of sync. It assumes that the devices themselves are available, the data is on them, and that our "failure" is e.g. the loss of a controller taking out four drives.

The author recently dealt with precisely this scenario during a reshape, and found a lot of information in the mailing list archives that he will aim to reproduce here, with examples. For the sake of example, assume we have a ten-disk RAID6 that's already lost two drives and is 80% of the way through a reshape, when we suddenly lose four drives at a stroke. Our broken array now looks like this:

Array State : A......AAA ('A' == active, '.' == missing),

with the four drives that went south looking like this:

Array State : AAAAA..AAA ('A' == active, '.' == missing)

and also having a lower event count. An attempt at assembling the array tells us that we have four drives out of ten, not enough to start the array. At this point, we have no option but to recreate the array, which involves telling mdadm --create which devices to use in what slots in order to put the array back together the way it was before. Assuming you made a dump of mdadm --examine as described above, you can do something like:

grep Role raid.status

and you will get output such as:

  Device Role : Active device 0
  Device Role : Active device 1
  Device Role : Active device 2
  Device Role : Active device 3
  Device Role : Active device 4
  Device Role : spare
  Device Role : spare
  Device Role : spare
  Device Role : Active device 9
  Device Role : Active device 8
  Device Role : Active device 7
  Device Role : spare
  Device Role : spare

(example output comes from a 10-drive RAID6 as described above). Knowing that this list starts with /dev/sdb1 and works its way sequentially through to /dev/sdn1, we can work out that slots 0 to 4 are filled by /dev/sd[bcdef]1, slots 5 and 6 are missing and that slots 7,8 and 9 are filled by /dev/sd[lkj]1 in that order. This is what mdadm --create needs to know to put the array back together in the order it was created.

One other problem you may run into is that different versions of mdadm apparently use different RAID device sizes; while creating a larger array than your filesystem won't hurt anything, creating a smaller one will definitely not work. To get the device size, use:

grep Used raid.status

which should give you output lines like the following for each device:

 Used Dev Size : 3907026688 (1863.02 GiB 2000.40 GB)

mdadm wants the device size in Kibibytes, while the above is apparently in 512 byte sectors; thus dividing by 2 gives the device size: 1953513344 KiB.

So, in our example case, the command to recreate the array was:

mdadm --create --assume-clean --level=6 --raid-devices=10 --size=1953513344 /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 missing missing /dev/sdl1 /dev/sdk1 /dev/sdj1

This duly told us what it found on the disks, and asked for confirmation. Please make sure you've preserved that mdadm --examine output before you give that confirmation, just in case you screw up. Once you're sure, go on and create the array. With any luck, your array will be created and assembled in degraded mode; we were then able to mount the ext4 filesystem and verify that we'd got it right before adding back in other devices as spares, triggering a conventional RAID recovery.

Also, check the chunk size from the raid.status file and if it is non-standard, use --chunk XXX when recreating the array.

If upon running the above with the --size parameter you get, as one of the authors of this page did, an error such as: "mdadm: /dev/sdb1 is smaller than given size. xxxK < yyyK + metadata", you may have stumbled upon a problem where the array was initially created with an earlier version of mdadm that reserved less device space. The solution seems to be to find an earlier version of mdadm to run with the creation command above (in this author's case, mdadm from Debian "squeeze" worked while mdadm from Debian "wheezy" refused to recreate the array of the required size).

Personal tools