RAID superblock formats

From Linux Raid Wiki
Jump to: navigation, search

Contents

RAID superblock formats

Currently, the Linux kernel RAID subsystem recognizes two distinct variant superblocks.

They are known as "version-0.90" and "version-1" Superblock formats.

A Note about kernel autodetection of different superblock formats

Current Linux kernels (as of 2.6.28) can only autodetect (based on partition type being set to FD) arrays with superblock version 0.90.

The boot-loader LILO also can only boot from the version 0.90 superblock arrays. Alternative boot loaders, GRUB specifically, probably don't have this particular limitation.

As a workaround for the kernel-autodetection issue, several distributions, including Ubuntu and Fedora, circa early 2009, include init scripts that run any arrays that aren't started by auto-detect, which can include arrays using the newer 1.x superblocks.

Using Fedora 9 as the example, the initscript file that does this is named /etc/rc.d/rc.sysinit. The command used is:

   # Start any MD RAID arrays that haven't been started yet
   [ -f /etc/mdadm.conf -a -x /sbin/mdadm ] && /sbin/mdadm -As --auto=yes --run

mdadm v3.0 -- Adding the Concept of User-Space Managed External Metadata Formats

In the development packages for mdadm v3.0, a new concept is added to the traditional version-0.90 and version-1 superblocks.

  Currently two such metadata formats are supported:
    - DDF  - The SNIA standard format
    - Intel Matrix - The metadata used by recent Intel ICH controlers.
  
  Externally managed metadata introduces the concept of a 'container'.
  A container is a collection of (normally) physical devices which have
  a common set of metadata.  A container is assembled as an md array, but
  is left 'inactive'.
  A container can contain one or more data arrays.  These are composed from
  slices (partitions?) of various devices in the container.
  For example, a 5 devices DDF set can container a RAID1 using the first
  half of two devices, a RAID0 using the first half of the remain 3 devices,
  and a RAID5 over thte second half of all 5 devices.


The version-0.90 Superblock Format

Though it used to be the default format of raid superblock during array creation on most distributions until 2009, the older version-0.90 superblock format has several limitations that limit its applicability for use on large arrays or arrays with many component devices.

The version-0.90 superblock limits the number of component devices within an array to 28, and limits each component device to a maximum size of 2TB on kernel version <3.1 and 4TB on kernel version >=3.1. The data is 'endian' which prevents disk devices being moved between big and little endian machines. There is no room for new metadata like hostnames.

The superblock is 4K long and is written into a 64K aligned block that starts at least 64K and less than 128K from the end of the device (i.e. to get the address of the superblock round the size of the device down to a multiple of 64K and then subtract 64K). The available size of each device is the amount of space before the super block, so between 64K and 128K is lost when a device in incorporated into an MD array.

All data the superblock contains is listed in the struct "mdp_superblock_s" in md_p.h in the mdadm source code.

The version-1 Superblock Format

The newer and well-supported version-1 superblock format is more-expansion friendly than the previous format. It is the default as of v3.1.1. More specifically, --metadata=1.2 is used as of v3.1.2.

The version-1 superblock is capable of supporting arrays with 384+ component devices, and supports arrays with 64-bit sector lengths.

Note: Current version-1 superblocks use an unsigned 32-bit number for the dev_number but only index the dev_numbers using an array of unsigned 16-bit numbers (so the theoretical range of device numbers in a single array is 0x0000 - 0xFFFD), which allows for 65,534 devices.

Sub-versions of the version-1 superblock

The "version-1" superblock format is currently used in three different "sub-versions".

Sub-Version Superblock Position on Device
0.9 At the end of the device
1.0 At the end of the device
1.1 At the beginning of the device
1.2 4K from the beginning of the device

The sub-versions differ primarily (solely?) in the location on each component device at which they actually store the superblock.

Putting the superblock at the end of the device is dangerous if you have any kind of auto-mounting/auto-detection/auto-activation of the raid contents; in some circumstances (in the case of blkid: if the superblock is damaged) the raid components could be detected as a valid filesystem (or other format) which may contain outdated data. This will desynchronise the array and compromise the data. The other important factor for superblock location is that PC bootloaders need to write to the first sector of a device to make it bootable.

Converting between superblock versions

At the moment, there is no tool to convert between the different superblock versions. However, Neil Brown has posted a work-around to the linux-raid mailing list (see [1]) that allows upgrading an existing RAID from a 0.90 superblock to a 1.0 superblock. By forcing to recreate the RAID with a newer metadata version, mdadm will overwrite an existing superblock with a 1.0 superblock. This method can only be used to upgrade from 0.90 to 1.0, because the new superblock will be placed at the same location on the drive and does not exceed the old superblock's size.

To upgrade the superblock version, perform the following steps:

   # Query the RAID's details such as chunk size, layout, etc.
   mdadm --detail /dev/md0
   # Stop the array
   mdadm --stop /dev/md0
   # Re-create the array with 1.0 metadata (ensure to specify all details, in case defaults have changed!)
   mdadm --create /dev/md0 -l6 -n8 -c64 --layout=la --metadata=1.0 --assume-clean /dev/sda1 /dev/sdb1 /dev/sdc1 ....

When re-creating the array, it is important to specify all details according to the output of the first command. If a setting is left out and defaults have changed, this can cause the array to be re-created with different settings and your data may be lost. Back up your data first.

NOTE: After re-creating the array, its UUID will have changed. Remember to update mdadm.conf accordingly.

The version-1 superblock format on-disk layout

Total Size of superblock

Total Size of superblock: 256 Bytes, plus 2 bytes per device in the array

Section: Superblock/"Magic-Number" Identification area

16 Bytes, Offset 0-15 (0x00 - 0x0F)

Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
0x00 - 0x03 0 - 3 4 magic "Magic Number"
(Superblock ID)
__u32 0xa92b4efc
(little-endian)
0x04 - 0x07 4 - 7 4 major_version Major Version
of the Superblock
__u32 1
0x08 - 0x0B 8 - 11 4 feature_map Feature Map - which extended features (such as volume bitmaps, recovery, or reshape) are in use on this array __u32 0
Bit-Mapped Field

Bit OffsetBit ValueMeaning
01RAID Bitmap is used
12RAID Recovery is in progress
(See "recovery_offset")
24RAID Reshape is in progress
38undefined/reserved (0)
416undefined/reserved (0)
532undefined/reserved (0)
664undefined/reserved (0)
7128undefined/reserved (0)
 
0x0C - 0x0F 12 - 15 4 pad0 Padding Block 0 __u32 0 Always set to
zero when writing


Section: Per-Array Identification & Configuration area

84 Bytes, Offset 16-99 (0x10 - 0x63)

/
Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
0x10 - 0x1F 16 - 31 16 set_uuid UUID for the Array(?) __u8[16] Set by user-space
formatting utility (mdadm, for example).
0x20 - 0x3F 32 - 63 32 set_name Name for the Array(?) char[32] Set and used by
user-space utilities (mdadm, for example).
Nt
0x40 - 0x47 64 - 71 8 ctime Creation Time __u64 low 40-bits are seconds
high 24-bits are microseconds
0x48 - 0x4B 72 - 75 4 level RAID Level
of the Array
__u32
ValueMeaning
-4Multi-Path
-1Linear
0RAID-0 (Striped)
1RAID-1 (Mirrored)
4RAID-4 (Striped with Dedicated Block-Level Parity)
5RAID-5 (Striped with Distributed Parity)
6RAID-6 (Striped with Dual Parity)
ARAID-10 (Mirror of stripes)
Note: mdadm versions (as of v2.6.4) limit RAID-6 (creation) to 256 disks or less.
0x4C - 0x4F 76 - 79 4 layout layout of array
(RAID5(and 6? and raid 10?) only)
__u32
ValueMeaning
0left asymmetric
1right asymmetric
2left symmetric (default)
3right symmetric
01 02 01 00Raid-10 Offset2
Controls the relative arrangement of data and parity blocks on the disks.
0x50 - 0x57 80 - 87 8 size used-size of component devices __u64 size of component devices
(in # of 512-byte sectors)
0x58 - 0x5B 88 - 91 4 chunksize chunk-size of the array __u32 chunk-size of the array
(in # of 512-byte sectors)

default is 64K? for raid levels 0, 10, 4, 5, and 6
chunksize not used in raid levels 1, linear, and multi-path

Note: During creation this appears to be created as a multiple of 1024 rather than 512.

0x5C - 0x5F 92 - 95 4 raid_disks (?)number of disks in array(?) __u32 #

raid4 requires a minimum of 2 member devs
raid5 requires a minimum of 2 member devs
raid6 requires a minimum of 4 member devs
raid6 limited to a max of 256 member devs

0x60 - 0x63 96 - 99 4 bitmap_offset # of sectors after superblock
that bitmap starts
(See note about signed value)
__u32 (signed) Note:This is only meaningful if feature_map[0] is set.

This is a Signed value which allows the bitmap to appear physically before the superblock on the disk.


Section: RAID-Reshape In-Process Metadata Storage/Recovery area

28 Bytes, Offset 100-127 (0x64 - 0x7F)
(Note: Only contains valid data if feature_map bit '4' is set)

</tr>

Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
0x64 - 0x67 100 - 103 4 new_level the new RAID level being reshaped-to __u32 see level field (above)  
0x68 - 0x6F 104 - 111 8 reshape_position Next address of the array to reshape __u64 current position of the reshape operation  
0x70 - 0x73 112 - 115 4 delta_disks this holds the change
in # of raid disks
__u32 change in # of raid disks  
0x74 - 0x77 116 - 119 4 new_layout new layout for array __u32 see layout field (above)  
0x78 - 0x7B 120 - 123 4 new_chunk new chunk size __u32 see chunksize field (above)  
0x7C - 0x7F 124 - 127 4 pad1 Padding Block #1 __u8[4] 0 Always set to
zero when writing



Section: This-Component-Device Information area

64 Bytes, Offset 128-191 (0x80 - 0xbf)

Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
0x80 - 0x87 128 - 135 8 data_offset the sector # upon which data starts __u64 sector # where data begins
(0 if the superblock is at the end; the reserved space for the superblock and bitmap otherwise)
0x88 - 0x8F 136 - 143 8 data_size sectors in the device
that are used for data
__u64 # of sectors that can be used for data
0x90 - 0x97 144 - 151 8 super_offset # of the sector upon
which this superblock starts
__u64 # of the sector upon
which this superblock starts
0x98 - 0x9F 152 - 159 8 recovery_offset sectors before this offset
(from data_offset)
have been recovered
__u64 sector #
0xA0 - 0xA3 160 - 163 4 dev_number Fm __u32 Permanent identifier of this device (Not its role in RAID(?)) This is shown as "Array Slot" by the mdadm v2.x "--examine" command

Note: This is a 32-bit unsigned integer, but the Device-Roles (Positions-in-Array) Area indexes these values using only 16-bit unsigned integers, and reserves the values 0xFFFF as spare and 0xFFFE as faulty, so only 65,534 devices per array are possible.

0xA4 - 0xA7 164 - 167 4 cnt_corrected_read Number of read-errors that were corrected by re-writing __u32 Dv
0xA8 - 0xB7 168 - 183 16 device_uuid UUID of the component device __u8[16] Set by User-Space
Ignored by kernel
0xB8 184 1 devflags Per-Device Flags
(Bit-Mapped Field)
__u8 Bit-Mapped Field

Bit OffsetBit ValueMeaning
01WriteMostly1
22(?)
24(?)
38(?)
416(?)
532(?)
664(?)
7128(?)
WriteMostly1 indicates that this device should only be updated on writes, not read from. (Useful with slow devices in RAID1 arrays?)
0xB9 - 0xBF 185 - 191 7 pad2 Padding block 2 __u8[7] 0 Always set to
zero when writing


Section: Array-State Information area

64 Bytes, Offset 192-255 (0xC0 - 0xFF)

Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
0xC0 - 0xC7 192 - 199 8 utime Time of last superblock update __u64 low 40-bits are seconds
high 24-bits are microseconds
Updated whenever the superblock is updated.
0xC8 - 0xCF 200 - 207 8 events Event Count for the Array __u64 # Incremented whenever the superblock is updated.

Used by mdadm in re-assembly to detect failed/out-of-sync component devices.
0xD0 - 0xD7 208 - 215 8 resync_offset Offsets before this one (starting from data_offset) are 'known' to be in sync. __u64 offset #
0xD8 - 0xDB 216 - 219 4 sb_csum Checksum of this superblock up to devs[max_dev] __u32 # This value will be different for each component device's superblock.
0xDC - 0xDF 220 - 223 4 max_dev How many devices are part of (or related to) the array __u32 #
0xE0 - 0xFF 224 - 255 32 pad3 Padding Block 3 __u8[32] 0 Always set to
zero when writing


Section: Device-Roles (Positions-in-Array) area

Length: Variable number of bytes (but at least 768 bytes?)
2 Bytes per device in the array, including both spare-devices and faulty-devices

Section: Device-Roles (Positions-in-Array) area
(Variable length - 2 Bytes per Device in Array (including Spares/Faulty-Devs)
 
Offset (Hex) Offset (Dec) Length
(in bytes)
Field Name Usage/Meaning Data Type Data Value Notes
768 (or more?) Bytes, Offset 256-1023 (0x100 - 0x3FF)
0x100 - 0x101 256 - 257 2 dev_roles Fm __u16 Role or Position of first device in the array.
0xFFFF means "spare".
0xFFFE means "faulty".
0x102 - 0x103 258 - 259 2 dev_roles Fm __u16 Role or Position of second device in the array.
...
Personal tools