Scrubbing the drives

Latest revision as of 17:34, 10 January 2017

Back to A guide to mdadm Forward to Monitoring your system

Any array should be "scrubbed" at regular intervals. This basically involves reading the entire array, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys

echo check > /sys/block/mdX/md/sync_action

This causes the system to read the entire array looking for read errors. If a read error is encountered, the block in error is calculated and written back. If the array is a mirror, as it can't calculate the correct data, it will take the data from the first (available) drive and write it back to the dodgy drive. Drives are designed to handle this - if necessary the disk sector will be re-allocated and moved. SMART should be used to track how many sectors have been moved, as this can be a sign of a failing disk that will need replacing.

echo repair > /sys/block/mdX/md/sync_action

Only valid for parity raids - this will also check the integrity of the data as it reads it, and rewrite a corrupt stripe. It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.

DO NOT run this on raid-6 without making sure that it is the correct thing to do.

With a raid-5 array the only thing that can be done when there is an error is to correct the parity. This is also the most likely error - the scenario where the data has been flushed and the parity not updated is the expected cause of problems like this.

With raid-6, it is possible to detect which block is corrupt - certainly on a single-bit error. However, "repair" will not correct this sort of error. Be very careful using it - it may just rewrite both parities and leave any corruption in place. The reason for this behaviour is that there is no easily detectable cause for data error and the correct repair strategy needs user intervention. There is a utility "raid6check" that you should use if "check" flags data errors on a raid-6.

echo idle > /sys/block/mdX/md/sync_action

This will stop any check or repair that is currently in progress. It is okay to do so, as both of them normally only ever read the data, so interrupting them won't leave the system in an inconsistent state.

NOTE: On a system with an array of 10TB or more, the manufacturers specifications state that a desktop drive is healthy if you get, on average, less than one read error per check. So if you're using desktop drives and you haven't fixed the timeout error, a check can crash your array despite all the drives being nominally healthy!

If the system is rebooted or otherwise interrupted while a scan is in progress, the scan will resume from the beginning when the array comes back up. Improving on this is a Programming project.

Back to A guide to mdadm Forward to Monitoring your system

Scrubbing the drives

Latest revision as of 17:34, 10 January 2017

Views

Personal tools

Navigation

Search

Tools

@@ Line 4: / Line 4: @@
 |}
+Any array should be "scrubbed" at regular intervals. This basically involves reading the entire array, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys
+ echo check > /sys/block/mdX/md/sync_action
+This causes the system to read the entire array looking for read errors. If a read error is encountered, the block in error is calculated and written back. If the array is a mirror, as it can't calculate the correct data, it will take the data from the first (available) drive and write it back to the dodgy drive. Drives are designed to handle this - if necessary the disk sector will be re-allocated and moved. SMART should be used to track how many sectors have been moved, as this can be a sign of a failing disk that will need replacing.
+ echo repair > /sys/block/mdX/md/sync_action
+Only valid for parity raids - this will also check the integrity of the data as it reads it, and rewrite a corrupt stripe. It will terminate immediately without doing anything if the array is degraded, as it cannot recalculate the faulty data.
+DO NOT run this on raid-6 without making sure that it is the correct thing to do.
+With a raid-5 array the only thing that can be done when there is an error is to correct the parity. This is also the most likely error - the scenario where the data has been flushed and the parity not updated is the expected cause of problems like this.
+With raid-6, it is possible to detect which block is corrupt - certainly on a single-bit error. However, "repair" will <em>not</em> correct this sort of error. Be very careful using it - it may just rewrite both parities and leave any corruption in place. The reason for this behaviour is that there is no easily detectable cause for data error and the correct repair strategy needs user intervention. There is a utility "raid6check" that you should use if "check" flags data errors on a raid-6.
+ echo idle > /sys/block/mdX/md/sync_action
+This will stop any check or repair that is currently in progress. It is okay to do so, as both of them normally only ever read the data, so interrupting them won't leave the system in an inconsistent state.
+NOTE: On a system with an array of 10TB or more, the manufacturers specifications state that a desktop drive is healthy if you get, on average, less than one read error per check. So if you're using desktop drives and you haven't fixed the timeout error, a check can crash your array despite all the drives being nominally healthy!
+If the system is rebooted or otherwise interrupted while a scan is in progress, the scan will resume from the beginning when the array comes back up. Improving on this is a [[Programming projects|Programming project]].
 {| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"