Monitoring your system

From Linux Raid Wiki
(Difference between revisions)
Jump to: navigation, search
m
m (update backwards)
Line 1: Line 1:
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
 
|- padding:5px;padding-top:0.5em;font-size: 95%;  
 
|- padding:5px;padding-top:0.5em;font-size: 95%;  
| Back to [[A guide to mdadm]]  
+
| Back to [[Scrubbing the drives]]  
 
|}
 
|}
 
Many of the horror stories that come to the linux raid mailing list are down to a simple lack of monitoring. Okay, it's not unknown for several disks to fail simultaneously, and if your raid array consists of a bunch of drives all bought at the same time, for the array, the odds of that happening are painfully high - batches of disks tend to have similar lifetimes.
 
Many of the horror stories that come to the linux raid mailing list are down to a simple lack of monitoring. Okay, it's not unknown for several disks to fail simultaneously, and if your raid array consists of a bunch of drives all bought at the same time, for the array, the odds of that happening are painfully high - batches of disks tend to have similar lifetimes.
Line 31: Line 31:
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
 
{| style="border:1px solid #aaaaaa; background-color:#f9f9f9;width:100%; font-family: Verdana, sans-serif;"
 
|- padding:5px;padding-top:0.5em;font-size: 95%;  
 
|- padding:5px;padding-top:0.5em;font-size: 95%;  
| Back to [[A guide to mdadm]]  
+
| Back to [[Scrubbing the drives]]  
 
|}
 
|}

Revision as of 18:08, 7 January 2017

Back to Scrubbing the drives

Many of the horror stories that come to the linux raid mailing list are down to a simple lack of monitoring. Okay, it's not unknown for several disks to fail simultaneously, and if your raid array consists of a bunch of drives all bought at the same time, for the array, the odds of that happening are painfully high - batches of disks tend to have similar lifetimes.

But all too often, an array has been running in a degraded state for months, and then a disk fails and tips the array over the edge. The author's brother told him of a raid array, bought and placed in a colo facility, where a technician just happened to walk past and spot two red lights! The raid 6 array had two failed drives! This should never have happened, and of course there was a mad panic while they tried to safely replace the dead drives.

Contents

Monitoring Tools

/proc/mdstat

You should get to know /proc/mdstat, looking at it often. This will tell you the state of your arrays, and very importantly it will tell you whether any drives have failed, and whether any arrays are degraded. Check, and check regularly!

xosview

xosview is a venerable utility, and one of the author's favourites. It is capable of displaying the state of raid arrays, but unfortunately currently the code is broken - it reads mdstat, and doesn't understand the current output. It is currently (2016) being updated to read the status directly from /sys, and should hopefully soon be able to display raid status correctly. The author leaves xosview running permanently on his desktop to provide an overview of system performance.

mdadm

mdadm --monitor --scan --mail a@b.co.uk

This will fire up mdadm to keep an eye on your arrays. It will daemonize and run in the background, sending an email to the specified address if it detects any problems related to a disk failure. This is good for remote monitoring BUT. It won't tell you if anything goes wrong with the monitoring! You cannot assume - even if you put this in your boot-up sequence as you should - that you will be notified about important events. It's not unknown for the daemon to fail.

Don't rely on this! Check regularly on a manual basis!

Analysing a Disk Failure

Back to Scrubbing the drives
Personal tools