The Badblocks controversy

From Linux Raid Wiki
Jump to: navigation, search

Contents

The Controversy

The Bad Blocks feature was added to raid because there are various things that can result in a corrupted disk. Top of the list is a --replace that can't read a failing drive. This is a sensible way of handling the problem, except that it doesn't work. This is because the implementation is almost certainly buggy.

So much so, that various list members think that the feature should not be enabled by default, and indeed should be removed from MD raid altogether.

This is almost certainly an over-reaction because, properly debugged, this feature would make a failing array much more robust than it is at present. However, as it stands, the implementation causes a fair amount of grief because the Bad Blocks list seems to get copied where it shouldn't be, and hangs around when it should be cleared.

What is a Bad Block List?

In the "good old days" of dumb drives, and not very intelligent computers, the computer had to manage the disk all by itself. All drives ship with some manufacturing defects, so the computer had to keep a list of all the blocks that were unreliable, and when saving files, it had to avoid those blocks. This list itself was stored in a standard location on the drive, so of course if the standard location itself was unreadable, the drive was scrap.

Move forward to today, and drives are intelligent - quite often they are actually running minix, or linux, or some other OS in their firmware, dedicated solely to managing the disk. So they manage the bad block list, they manage the CHS or LBA references, and as far as the computer using the drive is concerned, they just see a contiguous list of drive sectors that don't have any errors. Behind the scenes, however, if there is a physical problem with the disk, the disk firmware reshuffles where the data is stored on the disk, and hides it from the drive user. The user may see a read error, and have to deal with the loss of the data on that sector, but when they come to write to that sector again the drive will have moved the real physical location so the sector appears perfect again.

Note however that sectors can become unreadable for several reasons, not just because the recording medium is no longer okay. Writing a neighbouring sector can mess up the magnetism, or just the passage of time may cause the magnetism to fade, so that the sector is still perfectly okay, just the data on it has been trashed. This means that badblock lists should come in two forms, a temporary list of sectors that could not be read, and a permanent list where a "write then verify" fails and indicates that the sector cannot be successfully written. This obviously causes problems should a fault in the magnetic layer mean that the data decays over a period of hours or days, but unfortunately the only way to handle this scenario is manual intervention.

The MD badblocks list

Because the raid layer has no way of knowing about the filesystem layer above it, it needs some way of passing an error up the chain. The usual cause of MD bad blocks is when a management action at the MD level fails because of problems in the layer below.

A typical action would be a --replace. If it can, it makes sense for --replace to just stream data from the old drive to the new. If it hits a read error, it reads the entire stripe and tries to recreate the missing data. If that fails, then the stripe is corrupt. The Bad Blocks feature provides a way of telling the layer above that the data is corrupt - the block is entered into the list and when the layer above tries to read it, the MD layer knows to return an error. At this point, the user knows they have a corrupt file.

Seeing as MD is a virtual layer, all Bad Blocks should be temporary. Any attempt to rewrite a corrupt stripe should succeed, leading to the Bad Block being removed from the list, so even if it is enabled you would expect the Bad Block List to be empty most of the time.

This feature would also be incredibly useful when rescuing a disk with an external tool like ddrescue. This creates a list of blocks it could not recover, and if there was a utility that could take the ddrescue list and convert it to an MD Bad Blocks list, it would help recover many arrays. This would be especially useful when several drives have suffered failures and can be ddrescued, as provided no two (or three for raid-6) drives have suffered failures in the same stripe, a scrub would then suffer a partial read failure and be able to recreate the missing data.

How do I fix a Bad Blocks problem?

The first thing is to try and check the integrity of your file system. A command like "tar cf / > /dev/null" will read the entire file system and tell you if any files are unreadable. It should also clear any Bad Blocks that have data on them but are recoverable. However, this is a known bug - that doesn't always happen.

But the Bad Blocks may be on an unallocated portion of the file system. If you wish to clear that, try a command like "cat /dev/zero > /tempfile &; rm /tempfile". This will fill all your spare disk space with zeroes, then delete the file it used to do so.

After both these things have been done, your Bad Blocks list should be empty. However, both these commands are very disk-heavy, and will take a very long time on a modern array. Plus the code is strongly suspected to be buggy so these commands could very likely not work.

If you are satisfied that everything is okay, and you don't want the Bad Blocks functionality, the easy way to get rid of it (if you have no Bad Blocks list to clear) is "mdadm ... --assemble --update=no-bbl".

If, however, you do have an active Bad Blocks list with sectors in it, this command won't work. You can use the command "mdadm ... --assemble --update=force-no-bbl" to delete the list, but this will now mean that mdadm will probably return garbage where before it failed with an error. If you're satisfied that your file system is intact, though, this won't matter to you.

Known Bad Blocks bugs

As mentioned above, bad blocks are often not cleared from the temporary list after a successful write. MD also has a permanent list which should ALWAYS be empty. If there's something in that, seeing as MD is a virtual layer, it is indicative of a deeper problem.

The list is meant to be drive-specific, but all too often it seems that identical lists appear on several drives. Something is copying the list from one drive to another, and this is not a sensible action.

Personal tools