Initial Array Creation

From Linux Raid Wiki
Jump to: navigation, search

Contents

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity is what's called the "initial resync".

Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The raid algorithm then reads the data blocks and writes the appropriate parity/mirror (P+Q) blocks across all the relevant disks. There is some sample output in a section below...

For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 'spare'; it then creates the array in degraded mode. The kernel marks the spare disk as 'rebuilding' and starts to read from the 'good' disks, calculate the parity and determines what should be on the spare disk and then just writes to it.

Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is happening (it is however fully useable).

--assume-clean

Some people have noticed the --assume-clean option in mdadm and speculated that this can be used to skip the initial resync. Which it does. But this is a bad idea in some cases - and a *very* bad idea in others.

raid5

For raid5 especially it is NOT safe to skip the initial sync. The raid5 implementation optimises use of the component disks and it is possible for all updates to be "read-modify-write" updates which assume the parity is correct. If it is wrong, it stays wrong. Then when you lose a drive, the parity blocks are wrong so the data you recover using them is wrong. In other words - you will get data corruption.

For raid5 on an array with more than 3 drive, if you attempt to write a single block, it will:

  • read the current value of the block, and the parity block.
  • "subtract" the old value of the block from the parity, and "add" the new value.
  • write out the new data and the new parity.

If the parity was wrong before, it will still be wrong. If you then lose a drive, you lose your data.

Other raid levels

These raid levels do not strictly need an initial sync.

linear and raid0 have no redundancy.

raid1 always writes all data to all devices.

raid10 always writes all data to the number of copies that the raid holds. For example on a raid10,f2 or raid10,o2 of 6 disks, the data will only be written 2 times.

For raid6 it is also safe to not sync first. Raid6 always updates parity by reading all blocks in the stripe that aren't known and calculating P and Q. So the first write to a stripe will make P and Q correct for that stripe. This is current behaviour. There is no guarantee it will never changed (so theoretically one day you may upgrade your kernel and suffer data corruption on an old raid6 array if you did not do an initial resync).

Probably the most noticeable effect for the other raid levels is that if you don't sync first, then every check will find lots of errors. (Of course you could 'repair' instead of 'check'. Or do that once. Or something. Or just do an initial resync which is why it was written like that in the first place...)

Summary

In summary, it is 'safe' to use --assume-clean on raid levels other than raid5 though a "repair" is recommended before too long.

Potential 'Solutions'

There have been 'solutions' suggested including the use of bitmaps to efficiently store 'not yet synced' information about the array. It would be possible to have a 'this is not initialised' flag on the array, and if that is not set, always do a reconstruct-write rather than a read-modify-write. But the first time you have an unclean shutdown you are going to resync all the parity anyway (unless you have a bitmap....) so you may as well resync at the start. So essentially, at the moment, there is no interest in implementing this since the added complexity is not justified.

What's the problem anyway?

First of all RAID is all about being safe with your data.

And why is it such a big deal anyway? The initial resync doesn't stop you from using the array. If you wanted to put an array into production instantly and couldn't afford any slowdown due to resync, then you might want to skip the initial resync.... but is that really likely?

Also, the resync is suspended every time real IO operations are undertaken, or are in the queue, so resyncing shold only have a minimal effect on operations.

So what is --assume-clean for then?

Disaster recovery. If you want to build an array from components that used to be in a raid then this stops the kernel from scribbling on them. As the man page says :

"Use this ony if you really know what you are doing."

It is also safe to use --assume-clean if you are performing performance measurements of different raid configurations. Just be sure to rebuild your array without --assume-clean when you decide on your final configuration.

Show me...

OK, so here is the output from mdadm --detail /dev/md0 after a create for raid 5:

/dev/md0:
        Version : 00.90.03
  Creation Time : Thu Jun 21 09:08:42 2007
     Raid Level : raid5
     Array Size : 511680 (499.77 MiB 523.96 MB)
  Used Dev Size : 102336 (99.95 MiB 104.79 MB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Jun 21 09:08:42 2007
          State : clean, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 1% complete

           UUID : ae81f5c6:b3447350:ad85bc46:aeece1cb
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       7        1        0      active sync   /dev/loop1
       1       7        2        1      active sync   /dev/loop2
       2       7        3        2      active sync   /dev/loop3
       3       7        4        3      active sync   /dev/loop4
       4       7        5        4      active sync   /dev/loop5
       6       7        6        5      spare rebuilding   /dev/loop6

/proc/mdstat shows:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 loop6[6] loop5[4] loop4[3] loop3[2] loop2[1] loop1[0]
      511680 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUUU_]
      [>....................]  recovery =  2.0% (2176/102336) finish=0.7min speed=2176K/sec

unused devices: <none>

For raid6:

/dev/md0:
        Version : 00.90.03
  Creation Time : Thu Jun 21 09:11:31 2007
     Raid Level : raid6
     Array Size : 409344 (399.82 MiB 419.17 MB)
  Used Dev Size : 102336 (99.95 MiB 104.79 MB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Jun 21 09:11:31 2007
          State : clean, resyncing
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

 Rebuild Status : 2% complete

           UUID : 68e6e3af:8a267353:599e31b0:5dfd6c58
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       7        1        0      active sync   /dev/loop1
       1       7        2        1      active sync   /dev/loop2
       2       7        3        2      active sync   /dev/loop3
       3       7        4        3      active sync   /dev/loop4
       4       7        5        4      active sync   /dev/loop5
       5       7        6        5      active sync   /dev/loop6

and /proc/mdstat:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 loop6[5] loop5[4] loop4[3] loop3[2] loop2[1] loop1[0]
      409344 blocks level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  resync =  2.0% (2176/102336) finish=0.7min speed=2176K/sec

unused devices: <none>
Personal tools