Timeout Mismatch

From Linux Raid Wiki
Revision as of 23:43, 14 December 2017 by Anthony Youngman (Talk | contribs)

Jump to: navigation, search
Back to Asking for help Forward to Easy Fixes

Most cheap modern desktop drives do not support some form of managed error recovery. This seems to have started when the typical drive size hit 1TB. So for drives over 1TB, you should buy drives that are explicitly suitable for RAID. Most drives of 1TB or less are okay, but you should check first. RAID-rated drives aren't that much more expensive. (For some strange reason, every 2 1/2" laptop drive I've come across does support it!)

For a quick summary of the problem, when the OS tries to read from the disk, it sends the command and waits. What should happen is that drive returns the data successfully.

The proper sequence of events when something goes wrong is that the drive can't read the data, and it returns an error to the OS. The raid code then calculates what the data should be, and writes it back to the disk. Glitches like this are normal and, provided the disk isn't failing, this will correct the problem.

Unfortunately, with desktop drives, they can take over two minutes to give up, while the linux kernel will give up after 30 seconds. At which point, the RAID code recomputes the block and tries to write it back to the disk. The disk is still trying to read the data and fails to respond, so the raid code assumes the drive is dead and kicks it from the array. This is how a single error with these drives can easily kill an array.

To check whether this is the case, look at the output you got from smartctl, and see whether SCT Error Recovery Control is supported. If it isn't, this is your problem. On WD disks it may be called TLER. To just look at this parameter, you can use the command

smartctl -l scterc /dev/sdx

The following script was posted to the mailing list by Brad Campbell. Make sure it runs on every boot - the cheaper drives especially forget any settings you may make when the system is shut down. It will increase the timeout for all non-ERC drives. It also sets the timeout for ERC drives as many older desktop drives that do support it have inappropriate settings.

#!/bin/bash
for i in /dev/sd? ; do
    if smartctl -l scterc,70,70 $i > /dev/null ; then
        echo -n $i " is good "
    else
        echo 180 > /sys/block/${i/\/dev\/}/device/timeout
        echo -n $i " is  bad "
    fi;
    smartctl -i $i | egrep "(Device Model|Product:)"
    blockdev --setra 1024 $i
done 

WARNING: This does not work for all drives, although it seems to be older 2010-era ones that fail. The smartctl command attempts to set the ERC timeout to 7 seconds. This should either succeed and return 0, or fail and return an error code. Unfortunately, for drives that do not support SCT at all, the attempt to set ERC fails but returns 0, fooling the script. Whenever you get a new drive, you should make sure it behaves as expected.

[TODO: Discuss assembling a broken array with --force]

The following links-to-email have been collected by Phil Turmel as background reading to the problem. Read the entire threads if you have time.

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

[TODO: link to gmane if/when it comes back]

Back to Asking for help Forward to Easy Fixes
Personal tools