RAID superblock formats
(Clarify offsets for device table) |
(→Sub-versions of the version-1 superblock: superblock position) |
||
Line 70: | Line 70: | ||
<tr><td>1.2</td><td>4K from the beginning of the device</td></tr> | <tr><td>1.2</td><td>4K from the beginning of the device</td></tr> | ||
</table> | </table> | ||
+ | |||
+ | Putting the superblock at the end of the device is dangerous if you do auto-mounting of the raid contents; if the superblock is damaged the raid components could be detected as a valid filesystem (or other format) which may contain outdated data. | ||
==Converting between superblock versions== | ==Converting between superblock versions== |
Revision as of 15:58, 19 September 2013
RAID superblock formats
Currently, the Linux kernel RAID subsystem recognizes two distinct variant superblocks.
They are known as "version-0.90" and "version-1" Superblock formats.
A Note about kernel autodetection of different superblock formats
Current Linux kernels (as of 2.6.28) can only autodetect (based on partition type being set to FD) arrays with superblock version 0.90.
The boot-loader LILO also can only boot from the version 0.90 superblock arrays. Alternative boot loaders, GRUB specifically, probably don't have this particular limitation.
As a workaround for the kernel-autodetection issue, several distributions, including Ubuntu and Fedora, circa early 2009, include init scripts that run any arrays that aren't started by auto-detect, which can include arrays using the newer 1.x superblocks.
Using Fedora 9 as the example, the initscript file that does this is named /etc/rc.d/rc.sysinit. The command used is:
# Start any MD RAID arrays that haven't been started yet [ -f /etc/mdadm.conf -a -x /sbin/mdadm ] && /sbin/mdadm -As --auto=yes --run
mdadm v3.0 -- Adding the Concept of User-Space Managed External Metadata Formats
In the development packages for mdadm v3.0, a new concept is added to the traditional version-0.90 and version-1 superblocks.
Currently two such metadata formats are supported: - DDF - The SNIA standard format - Intel Matrix - The metadata used by recent Intel ICH controlers. Externally managed metadata introduces the concept of a 'container'. A container is a collection of (normally) physical devices which have a common set of metadata. A container is assembled as an md array, but is left 'inactive'.
A container can contain one or more data arrays. These are composed from slices (partitions?) of various devices in the container.
For example, a 5 devices DDF set can container a RAID1 using the first half of two devices, a RAID0 using the first half of the remain 3 devices, and a RAID5 over thte second half of all 5 devices.
The version-0.90 Superblock Format
Though it is the default format of raid superblock used during array creation on most distributions, the older version-0.90 superblock format has several limitations that limit its applicability for use on large arrays or arrays with many component devices.
The version-0.90 superblock limits the number of component devices within an array to 28, and limits each component device to a maximum size of 2TB on kernel version <3.1 and 4TB on kernel version >=3.1.
The superblock is 4K long and is written into a 64K aligned block that starts at least 64K and less than 128K from the end of the device (i.e. to get the address of the superblock round the size of the device down to a multiple of 64K and then subtract 64K). The available size of each device is the amount of space before the super block, so between 64K and 128K is lost when a device in incorporated into an MD array.
All data the superblock contains is listed in the struct "mdp_superblock_s" in md_p.h in the mdadm source code.
The version-1 Superblock Format
The newer and well-supported, but not yet default, version-1 superblock format is more-expansion friendly than the previous format.
The version-1 superblock is capable of supporting arrays with 384+ component devices, and supports arrays with 64-bit sector lengths.
Note: Current version-1 superblocks use an unsigned 32-bit number for the dev_number but only index the dev_numbers using an array of unsigned 16-bit numbers (so the theoretical range of device numbers in a single array is 0x0000 - 0xFFFD), which allows for 65,534 devices.
Sub-versions of the version-1 superblock
The "version-1" superblock format is currently used in three different "sub-versions".
The sub-versions differ primarily (solely?) in the location on each component device at which they actually store the superblock.
Sub-Version | Superblock Position on Device |
---|---|
1.0 | At the end of the device |
1.1 | At the beginning of the device |
1.2 | 4K from the beginning of the device |
Putting the superblock at the end of the device is dangerous if you do auto-mounting of the raid contents; if the superblock is damaged the raid components could be detected as a valid filesystem (or other format) which may contain outdated data.
Converting between superblock versions
At the moment, there is no tool to convert between the different superblock versions. However, Neil Brown has posted a work-around to the linux-raid mailing list (see [1]) that allows upgrading an existing RAID from a 0.90 superblock to a 1.0 superblock. By forcing to recreate the RAID with a newer metadata version, mdadm will overwrite an existing superblock with a 1.0 superblock. This method can only be used to upgrade from 0.90 to 1.0, because the new superblock will be placed at the same location on the drive and does not exceed the old superblock's size.
To upgrade the superblock version, perform the following steps:
# Query the RAID's details such as chunk size, layout, etc. mdadm --detail /dev/md0 # Stop the array mdadm --stop /dev/md0 # Re-create the array with 1.0 metadata (ensure to specify all details, in case defaults have changed!) mdadm --create /dev/md0 -l6 -n8 -c64 --layout=la --metadata=1.0 --assume-clean /dev/sda1 /dev/sdb1 /dev/sdc1 ....
When re-creating the array, it is important to specify all details according to the output of the first command. If a setting is left out and defaults have changed, this can cause the array to be re-created with different settings and your data may be lost. Back up your data first.
NOTE: After re-creating the array, its UUID will have changed. Remember to update mdadm.conf accordingly.
The version-1 superblock format on-disk layout
Total Size of superblock
Total Size of superblock: 256 Bytes, plus 2 bytes per device in the array
Section: Superblock/"Magic-Number" Identification area
16 Bytes, Offset 0-15 (0x00 - 0x0F)
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x00 - 0x03 | 0 - 3 | 4 | magic | "Magic Number" (Superblock ID) |
__u32 | 0xa92b4efc (little-endian) |
||||||||||||||||||||||||||||
0x04 - 0x07 | 4 - 7 | 4 | major_version | Major Version of the Superblock |
__u32 | 1 | ||||||||||||||||||||||||||||
0x08 - 0x0B | 8 - 11 | 4 | feature_map | Feature Map - which extended features (such as volume bitmaps, recovery, or reshape) are in use on this array | __u32 | 0 Bit-Mapped Field
|
||||||||||||||||||||||||||||
0x0C - 0x0F | 12 - 15 | 4 | pad0 | Padding Block 0 | __u32 | 0 | Always set to zero when writing |
Section: Per-Array Identification & Configuration area
84 Bytes, Offset 16-99 (0x10 - 0x63)
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x10 - 0x1F | 16 - 31 | 16 | set_uuid | UUID for the Array(?) | __u8[16] | Set by user-space formatting utility (mdadm, for example). |
|||||||||||||||||||
0x20 - 0x3F | 32 - 63 | 32 | set_name | Name for the Array(?) | char[32] | Set and used by user-space utilities (mdadm, for example). |
Nt | ||||||||||||||||||
0x40 - 0x47 | 64 - 71 | 8 | ctime | Creation Time | __u64 | low 40-bits are seconds high 24-bits are microseconds |
|||||||||||||||||||
0x48 - 0x4B | 72 - 75 | 4 | level | RAID Level of the Array |
__u32 |
|
Note: mdadm versions (as of v2.6.4) limit RAID-6 (creation) to 256 disks or less. | ||||||||||||||||||
0x4C - 0x4F | 76 - 79 | 4 | layout | layout of array (RAID5(and 6? and raid 10?) only) |
__u32 |
|
Controls the relative arrangement of data and parity blocks on the disks. | ||||||||||||||||||
0x50 - 0x57 | 80 - 87 | 8 | size | used-size of component devices | __u64 | size of component devices (in # of 512-byte sectors) |
|||||||||||||||||||
0x58 - 0x5B | 88 - 91 | 4 | chunksize | chunk-size of the array | __u32 | chunk-size of the array (in # of 512-byte sectors) |
default is 64K? for raid levels 0, 10, 4, 5, and 6 |
||||||||||||||||||
0x5C - 0x5F | 92 - 95 | 4 | raid_disks | (?)number of disks in array(?) | __u32 | # |
raid4 requires a minimum of 2 member devs |
||||||||||||||||||
0x60 - 0x63 | 96 - 99 | 4 | bitmap_offset | # of sectors after superblock that bitmap starts (See note about signed value) |
__u32 | (signed) | Note:This is only meaningful if feature_map[0] is set. This is a Signed value which allows the bitmap to appear physically before the superblock on the disk. |
Section: RAID-Reshape In-Process Metadata Storage/Recovery area
28 Bytes, Offset 100-127 (0x64 - 0x7F)
(Note: Only contains valid data if feature_map bit '4' is set)
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes |
---|---|---|---|---|---|---|---|
0x64 - 0x67 | 100 - 103 | 4 | new_level | the new RAID level being reshaped-to | __u32 | see level field (above) | |
0x68 - 0x6F | 104 - 111 | 8 | reshape_position | Next address of the array to reshape | __u64 | current position of the reshape operation | |
0x70 - 0x73 | 112 - 115 | 4 | delta_disks | this holds the change in # of raid disks |
__u32 | change in # of raid disks | |
0x74 - 0x77 | 116 - 119 | 4 | new_layout | new layout for array | __u32 | see layout field (above) | |
0x78 - 0x7B | 120 - 123 | 4 | new_chunk | new chunk size | __u32 | see chunksize field (above) | |
0x7C - 0x7F | 124 - 127 | 4 | pad1 | Padding Block #1 | __u8[4] | 0 | Always set to zero when writing |
Section: This-Component-Device Information area
64 Bytes, Offset 128-191 (0x80 - 0xbf)
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x80 - 0x87 | 128 - 135 | 8 | data_offset | the sector # upon which data starts | __u64 | sector # where data begins (Often 0) |
||||||||||||||||||||||||||||
0x88 - 0x8F | 136 - 143 | 8 | data_size | sectors in the device that are used for data |
__u64 | # of sectors that can be used for data | ||||||||||||||||||||||||||||
0x90 - 0x97 | 144 - 151 | 8 | super_offset | # of the sector upon which this superblock starts |
__u64 | # of the sector upon which this superblock starts |
||||||||||||||||||||||||||||
0x98 - 0x9F | 152 - 159 | 8 | recovery_offset | sectors before this offset (from data_offset) have been recovered |
__u64 | sector # | ||||||||||||||||||||||||||||
0xA0 - 0xA3 | 160 - 163 | 4 | dev_number | Fm | __u32 | Permanent identifier of this device (Not its role in RAID(?)) | This is shown as "Array Slot" by the mdadm v2.x "--examine" command
Note: This is a 32-bit unsigned integer, but the Device-Roles (Positions-in-Array) Area indexes these values using only 16-bit unsigned integers, and reserves the values 0xFFFF as spare and 0xFFFE as faulty, so only 65,534 devices per array are possible. |
|||||||||||||||||||||||||||
0xA4 - 0xA7 | 164 - 167 | 4 | cnt_corrected_read | Number of read-errors that were corrected by re-writing | __u32 | Dv | ||||||||||||||||||||||||||||
0xA8 - 0xB7 | 168 - 183 | 16 | device_uuid | UUID of the component device | __u8[16] | Set by User-Space Ignored by kernel |
||||||||||||||||||||||||||||
0xB8 | 184 | 1 | devflags | Per-Device Flags (Bit-Mapped Field) |
__u8 | Bit-Mapped Field
|
WriteMostly1 indicates that this device should only be updated on writes, not read from. (Useful with slow devices in RAID1 arrays?) | |||||||||||||||||||||||||||
0xB9 - 0xBF | 185 - 191 | 7 | pad2 | Padding block 2 | __u8[7] | 0 | Always set to zero when writing |
Section: Array-State Information area
64 Bytes, Offset 192-255 (0xC0 - 0xFF)
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes |
---|---|---|---|---|---|---|---|
0xC0 - 0xC7 | 192 - 199 | 8 | utime | Time of last superblock update | __u64 | low 40-bits are seconds high 24-bits are microseconds |
Updated whenever the superblock is updated. |
0xC8 - 0xCF | 200 - 207 | 8 | events | Event Count for the Array | __u64 | # | Incremented whenever the superblock is updated. Used by mdadm in re-assembly to detect failed/out-of-sync component devices. |
0xD0 - 0xD7 | 208 - 215 | 8 | resync_offset | Offsets before this one (starting from data_offset) are 'known' to be in sync. | __u64 | offset # | |
0xD8 - 0xDB | 216 - 219 | 4 | sb_csum | Checksum of this superblock up to devs[max_dev] | __u32 | # | This value will be different for each component device's superblock. |
0xDC - 0xDF | 220 - 223 | 4 | max_dev | How many devices are part of (or related to) the array | __u32 | # | |
0xE0 - 0xFF | 224 - 255 | 32 | pad3 | Padding Block 3 | __u8[32] | 0 | Always set to zero when writing |
Section: Device-Roles (Positions-in-Array) area
Length: Variable number of bytes (but at least 768 bytes?)
2 Bytes per device in the array, including both spare-devices and faulty-devices
Section: Device-Roles (Positions-in-Array) area | |||||||
---|---|---|---|---|---|---|---|
(Variable length - 2 Bytes per Device in Array (including Spares/Faulty-Devs) | |||||||
Offset (Hex) | Offset (Dec) | Length (in bytes) |
Field Name | Usage/Meaning | Data Type | Data Value | Notes |
768 (or more?) Bytes, Offset 256-1023 (0x100 - 0x3FF) | |||||||
0x100 - 0x101 | 256 - 257 | 2 | dev_roles | Fm | __u16 | Role or Position of first device in the array. 0xFFFF means "spare". 0xFFFE means "faulty". |
|
0x102 - 0x103 | 258 - 259 | 2 | dev_roles | Fm | __u16 | Role or Position of second device in the array. | |
... |