Timeout Mismatch

From Linux Raid Wiki
Revision as of 23:00, 22 September 2016 by Anthony Youngman (Talk | contribs)

Jump to: navigation, search
Back to Asking for help Forward to Reconstruction

Most cheap modern drives do not support some form of managed error recovery. This seems to have started when the typical drive size hit 1TB. So for drives over 1TB, you should buy drives that are explicitly suitable for RAID. Most drives of 1TB or less are okay, but you should check first. RAID-rated drives aren't that much more expensive.

For a quick summary of the problem, when the OS tries to read from the disk, it sends the command and waits. What should happen is that drive returns the data successfully.

The proper sequence of events when something goes wrong is that the drive can't read the data, and it returns an error to the OS. The raid code then calculates what the data should be, and writes it back to the disk. Glitches like this are normal and, provided the disk isn't failing, this will correct the problem.

Unfortunately, with desktop drives, they can take over two minutes to give up, while the linux kernel will give up after 30 seconds. At which point, the RAID code recomputes the block and tries to write it back to the disk. The disk is still trying to read the data and fails to respond, so the raid code assumes the drive is dead and kicks it from the array. This is how a single error with these drives can easily kill an array.

To check whether this is the case, look at the output you got from smartctl, and see whether SCT Error Recovery Control is supported. If it isn't, this is your problem. On WD disks it may be called TLER. To just look at this parameter, you can use the command

smartctl -l scterc /dev/sdx

The following script was posted to the mailing list by Brad Campbell. Make sure it runs on every boot - the cheaper drives especially forget any settings you may make when the system is shut down. It will increase the timeout for all non-ERC drives.

  1. !/bin/bash

for i in /dev/sd? ; do

   if smartctl -l scterc,70,70 $i > /dev/null ; then
       echo -n $i " is good "
   else
       echo 180 > /sys/block/${i/\/dev\/}/device/timeout
       echo -n $i " is  bad "
   fi;
   smartctl -i $i | egrep "(Device Model|Product:)"
   blockdev --setra 1024 $i

done

[TODO: Discuss assembling a broken array with --force]

The following links-to-email have been collected by Phil Turmel as background reading to the problem. Read the entire threads if you have time.

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

[TODO: link to gmane if/when it comes back]

Back to Asking for help Forward to Reconstruction
Personal tools