Partition alignment of drives with internal sector size larger than 512 bytes
This document (7007193) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Desktop 11 Service Pack 1
openSUSE 11.3
Situation
The issue
The drives achieve their best performance when accesses are aligned with the internal block size. As the Linux kernel typically does accesses of multiples of the hardware page size (4k on x86), unaligned reads often would result in one more internal block to be read then aligned accesses. Worse, on writes that only cover a partial internal block, the drives might need to do expensive read-modify-write (RMW) cycles rather than just a write. So on a rotating drive with 4k internal block size, a single 4k write that's unaligned may incur an 11ms penalty on a 5400rpm HDD.
Note: Some large storage arrays (SAN) use a 4k blocksize internally, too, without necessarily report it to the OS. So they will be profit from partition alignment as well.Partition alignment
The classical DOS partition alignment is unfortunate. With the classical
C/H/S = X/255/63 pseudo geometry translation scheme and the convention
to start the first partition at C/H/S = 1/1/0 (note that cylinder counting
traditionally starts with one for reasons that complaining about would be
beyond the scope of this article), which translates to a linear (LBA)
offset of 63 -- which is misaligned with anything that's used anywhere
and larger than 512 Bytes.
The solution suggested in this article is to have the partitions start
at aligned addresses. One way to achieve that is to use different CHS
schemes; using C/H/S = Y/240/56 e.g. would result in a 64k (128 sector
alignment) for primary partitions -- except that the first primary
partition would only be 4k aligned at offset LBA 56. As this will
not ensure a good alignment of the first primary partition nor the
the logical partitions in the extended partition, the description
here won't bother to go away from the classical CHS translation.
Rather it will use the fact that partitions don't need to start on
cylinder boundaries but can be moved to start at the next aligned
address.
Sidenote: This issue can NOT be avoided by using other, non-DOS partition table formats, like the GUID Partition Table (GPT). Even when using a GPT partitioning scheme you need to ensure that the partition is aligned properly. The benefit GPT gives you here is that you can use disks larger than 2TB, and as basically all of them are using 4k block sizes internally you would want to follow the guidelines stated in this document. Please be aware that many other OSes can not access GPT partitions and most BIOSes can't boot from a GPT disk, so check the compability before using it.
What alignment?Before doing the work, a decision needs to be taken what alignment
should be used. If the internal block size is known (like for disks
with internal 4k sectors), that one could be chosen.
For SSDs, it is generally not known.
But the friends from Redmond provide guidance here -- as Windows 7
by default uses 1M partition alignment, it is save to assume that
most drives will be optimized to provide good performance with such
alignment. So in case of doubt, it will never hurt to align partition starts to
1M boundaries.
There's one special case: When internal 4k block sizes were introduced,
some HDD manufacturers actually addressed the classical DOS partition
table misalignment by shifting the logical sector counting by one, so
a start at sector 63 would translate to sector 64 (i.e. internal block
8). Some HDDs even were configurable with a switch to do this shifting
by one. The SATA spec did even provide a mechanism for the drives to
report such an offset, so the OS can take the appropriate steps to
optimize performance. To our knowledge not many such drives exist;
and only a subset of them reports the offset correctly.
If the drive reports any alignment offset, the Linux kernel in SLE11-SP1
(or later) will report this via the attribute /sys/block/$DEV/alignment_offset
(in sectors).
Some drives will report their internal block size via
/sys/block/$DEV/queue/physical_block_size
though the SSDs tested all reported 512 (bytes) there and do not report
the internal erase block size which almost certainly is larger.
In summary, going with 1M (2048 sectors) alignment is still a good default choice -- unless we know about an alignment_offset. For convenience, there is a little python script that can be used to calculate recommended partition offsets athttp://www.suse.de/~garloff/align_partition.py
Moving partitions is dangerous
You need to do the partition alignment BEFORE creating a filesystem
on the partition, as moving the beginning of a partition will render
existing filesystems unaccessible. Let me repeat that:
DOING PARTITION ALIGNMENT ON PARTITIONS WITH EXISTING FILESYSTEMS WILL
CAUSE A LOSS OF DATA.
So make sure you have working backups (if there is anything
to backup).
This means you should follow the steps described below before using
a disk -- when you want to do an installation to the disk, the
recommendation is to first boot into a rescue system, doing the
partitioning and then reboot to start the real install process,
using the partitions unchanged and just putting filesystems on them.
(There is an option to change to the text console when running an
installation via YaST and do things there -- but you'd need to
make sure to have YaST reread the partition table eventually.)
For secondary disks this is obviously easier -- you just do the
steps out of a running system.
Note that if ANY partitions from a disk you modify the partition
table of are is use, e.g. because they contain mounted filesystems,
your modifications will only become visible upon reboot, so please
don't do mkfs or such on changed partitions before rebooting in
such a case, please.
Please also note that on some hard disks it is possible to use a jumper which internally moves all of the logical 512 byte sectors by one. You need to make sure that this jumper is not set!
Moving partitions with fdisk
The following step-by-step instructions provide a description how
to interactively move the beginning of partitions using the export
mode of fdisk. There are other ways (using e.g. parted) that are
not covered here.
You need to have write access to the raw disk device (e.g. /dev/sdb)
to do the following steps -- typically this means you need to be root.
If starting with a vanilla disk, first create partitions of the size
that you like. This can be done using fdisk, parted or more user-
friendly tools such as the YaST partitioner.
WHEN CHANGING PARTITIONS, IT'S HIGHLY RECOMMENDED THAT YOU DOUBLE
CHECK YOU ARE WORKING ON THE INTENDED DISK BEFORE DOING ANYTHING;
THE RISK OF LOSING DATA IS VERY HIGH OTHERWISE. Careful people
always have a paper hardcopy of their partition tables created by
e.g. fdisk -l | lpr so they can recover from such mistakes. Another
way if you exclusively have primary partitions is to save the first
512 bytes from your hard disk (containing the master boot record
and the partition table) to a file using dd or dd_rescue.
Now, let's move the beginning of the partitions to be well aligned. Let's assume your hard disk is called sdb.
- Start fdisk by calling fdisk /dev/sdb
- Print the partition table just to be sure you are looking at at the right disk: p
- Go to export mode: x
- Use the b command to move the beginning of a partitions: b
- Choose the partition you want to modifiy: NUMBER
- fdisk will prompt you for the NEW offset and will have a default
proposal that corresponds to the OLD offset.
NOTE: These offsets are in units of logcal sectors (512 bytes)
- Calculate the new offset by rounding the offset UP to the next
number that fits your alignment desire, e.g. the next larger multiple
of 8 if you want to achieve 4k alignment. (You can use the
align_partition.py script to do the math for you.)
- Enter the new offset and press enter
- Repeat for all partitions (go back to step 4)
- When done leave export mode: r
- Review the partition table: p It's not yet on the disk, so if you screwed up, now is the time to abort with: q
- When satisfied, write the changes to disk: w
This will also leave fdisk.
On the last step please watch for messages of the kernel failing to
read your new partition table. This would mean that some partition
of your disk is in use and that you'd need to reboot to fix up.
However, if this happens, there is a chance that you have actually screwed up
and modified the wrong disk :-( Now's the time to use the "fdisk -l"
printout and restore the old partition table manually ...
If everything went well, you can now start creating filesystems
using mkfs (or mkswap for swap space) or your favorite GUI tool on
the partitions or continue with LVM2 setup.
Additional hints
According to our experience it's not worth to fiddle with the
stride= pararmeter in ext2/3/4 for 4k drives or SSDs.
If you set up a raid system on top of aligned partitions, it helps
to use chunk sizes that are multiples of the internal block
size -- though with a default of 64k in mdadm, this typically
does not need any interventions.
For SSDs, using deadline or noop IO scheduler tends to provide a minor
increase in performance over CFQ -- though the latter detects the
fact that SSDs are non rotational devices (in SLE11SP1 or later) and
optimizes rather well for that case as well. So it's a matter of
trading minor performance gains via the ability to do some QoS with
CFQ.
You might also achieve minor gains by reducing the readahead size for
SSDs -- though putting it down to very small values will hurt your
linear (streaming) read performance a bit there as well.
For SSDs, one thing that has been observed especially with the first
generation of drives is that their write performance drops dramatically
as soon as the drives run out of empty erase blocks, which happens
after using them for a while.
More modern drives address this by recycling unused (zeroed-out) space
automatically and allowing the OS to tell the SSD about unused blocks using the
TRIM command. SLE11 SP1 ships with wiper.sh which will send down appropriate
TRIM commands by analyzing a filesystem. Note that this should be only used
after having done a backup. Also it has certain limitations, like e.g. not
supporting LVM or RAID and not supporting some file systems at all (btrfs) or
only supporting offline or read-only trimming for some filesystems.
Using SSDs to put your root filesystem on (and following the instructions
in this article) is possibly the most efficient investment into improving
the interactive experience using your system. Boot times and response
times of the system tend to experience huge improvements.
You can also put the swap partition on an SSD, which will result in making
a swapping system usable for much longer -- though the access patterns of
swapping seem to make the first gen SSDs degrade rather quickly in write
performance.
When you mount filesystems using the relatime -- or better, if you can validate that noatime does not hurt your use case -- mount option is very beneficial -- this applies in a rather general way though, not only to 4k HDDs or SSDs. relatime is used by default in SLE11 SP1.
Resolution
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7007193
- Creation Date: 11-Nov-2010
- Modified Date:03-Mar-2020
-
- SUSE Linux Enterprise Desktop
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com