Scrubbing the drives

From Linux Raid Wiki
(Difference between revisions)
Jump to: navigation, search
(Initial page creation)
 
Line 4: Line 4:
 
|}
 
|}
  
 +
Any array should be "scrubbed" at regular intervals. This basically involves reading the entire array, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys
 +
 +
echo check > /sys/block/mdX/md/sync_action
 +
 +
This causes the system to read the entire array looking for read errors. If a read error is encountered, the block in error is calculated and written back. If the array is a mirror, as it can't calculate the correct data, it will take the data from the first (available) drive and write it back to the dodgy drive. Drives are designed to handle this - if necessary the disk sector will be re-allocated and moved. SMART should be used to track how many sectors have been moved, as this can be a sign of a failing disk that will need replacing.
 +
 +
echo repair > /sys/block/mdX/md/sync_action
 +
 +
Only valid for parity raids - this will also check the integrity of the data as it reads it, and rewrite a corrupt stripe. It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.
 +
 +
[TODO: What does it do? For raid-5, presumably it assumes the data is correct and updates parity. For raid-6, presumably it can work out which drive is wrong and correct it?]
 +
 +
echo idle > /sys/block/mdX/md/sync_action
 +
 +
This will stop any check or repair that is currently in progress. It is okay to do so, as both of them normally only ever read the data, so interrupting them won't leave the system in an inconsistent state.
 +
 +
NOTE: On a system with an array of 10TB or more, the manufacturers specifications state that a desktop drive is healthy if you get, on average, less than one read error per check. So if you're using desktop drives and you haven't fixed the timeout error, a check can crash your array despite all the drives being nominally healthy!
 +
 +
If the system is rebooted or otherwise interrupted while a scan is in progress, the scan will resume from the beginning when the array comes back up. Improving on this is a [[Programming projects|Programming project]].
  
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"

Revision as of 18:36, 7 January 2017

Back to A guide to mdadm Forward to Monitoring your system

Any array should be "scrubbed" at regular intervals. This basically involves reading the entire array, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys

echo check > /sys/block/mdX/md/sync_action

This causes the system to read the entire array looking for read errors. If a read error is encountered, the block in error is calculated and written back. If the array is a mirror, as it can't calculate the correct data, it will take the data from the first (available) drive and write it back to the dodgy drive. Drives are designed to handle this - if necessary the disk sector will be re-allocated and moved. SMART should be used to track how many sectors have been moved, as this can be a sign of a failing disk that will need replacing.

echo repair > /sys/block/mdX/md/sync_action

Only valid for parity raids - this will also check the integrity of the data as it reads it, and rewrite a corrupt stripe. It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.

[TODO: What does it do? For raid-5, presumably it assumes the data is correct and updates parity. For raid-6, presumably it can work out which drive is wrong and correct it?]

echo idle > /sys/block/mdX/md/sync_action

This will stop any check or repair that is currently in progress. It is okay to do so, as both of them normally only ever read the data, so interrupting them won't leave the system in an inconsistent state.

NOTE: On a system with an array of 10TB or more, the manufacturers specifications state that a desktop drive is healthy if you get, on average, less than one read error per check. So if you're using desktop drives and you haven't fixed the timeout error, a check can crash your array despite all the drives being nominally healthy!

If the system is rebooted or otherwise interrupted while a scan is in progress, the scan will resume from the beginning when the array comes back up. Improving on this is a Programming project.

Back to A guide to mdadm Forward to Monitoring your system
Personal tools