Azure Shared Disks with “SLES for SAP / SLE HA 15 SP2”
Microsoft Azure Shared Disks now supports SUSE Linux Enterprise Server for SAP Applications and SUSE Linux Enterprise High Availability Extension 15 SP1 and above, as announced at July 2020 by Microsoft. With this new capability, it gives more flexibility to mission critical applications in the cloud environment, for example, SAP workload. Microsoft Azure Shared Disks provides high performance storage to the virtual machines running a SUSE Linux Enterprise Server operation system, and SUSE Linux Enterprise High Availability Extension adds the fault tolerance on top.
At the concept level, Microsoft Azure Shared Disks is not different from the other traditional shared disk technology on premises. This blog post mainly follows the latest SLE HA 15 SP2 Administration Guide to set up two use cases as outlined below (with the tuning of some parameters to accommodate the Azure environment).
- Active-Passive NFS server
- Active-Active OCFS2 cluster filesystem
NOTE: in this blog, the following acronyms are used:
“SLES” stands for “SUSE Linux Enterprise Server”
“SLES for SAP” stands for “SUSE Linux Enterprise Server for SAP”
“SLE HA” stands for “SUSE Linux Enterprise High Availability Extension”
“SBD” stands for STONITH Block Device
Prerequisites – Azure Environment
To check the Azure Cloud Shell environment
At the local command line environment, the `azure-cli` version must be at 2.3.1 or higher.
suse@tumbleweed:~> az --version
Or, go directly to https://shell.azure.com , which is okay for the basic usage, but much less flexible for Linux admins.
To create two virtual machines and the shared disk
1. get SUSE image URN from the marketplace
URN=`az vm image list --publisher SUSE -f sles-sap-15-sp2-byos \ --sku gen2 --all --query "[-1].urn"|tr -d '"'`; echo $URN
This returns the latest URN id, which will be used by the next steps:
SUSE:sles-sap-15-sp2-byos:gen2:2020.09.21
2. create two virtual machines from scratch
NOTE: all regions with managed disks support Azure Shared Disks now, [click here].
LC=westus2 RG=asd_${LC}_001 az group create --name $RG --location $LC az ppg create -n ppg_$RG -g $RG -l $LC -t standard az vm create --resource-group $RG --image $URN --ppg ppg_$RG \ --admin-username $USER --ssh-key-values ~/.ssh/id_rsa.pub \ --size Standard_D2s_v3 --name asd-sles15sp2-n1 az vm create --resource-group $RG --image $URN --ppg ppg_$RG \ --admin-username $USER --ssh-key-values ~/.ssh/id_rsa.pub \ --size Standard_D2s_v3 --name asd-sles15sp2-n2
3. To create the shared disks and attach them to virtual machines
Here, we create two shared disks, one for data disk, and one for sbd disk. SBD disk with dedicated IO path is a must-have for a cluster with the very heavy IO workload. One SBD device consumes very small disk space, and the cost is low according to Azure pricing model.
DN=asd_shared_disk_152_sbd az disk create -g $RG -n $DN -z 256 --sku Premium_LRS --max-shares 2 diskId=$(az disk show -g $RG -n $DN --query 'id' -o tsv); echo $diskId az vm disk attach -g $RG --name $diskId --cachin None --vm-name asd-sles15sp2-n1 az vm disk attach -g $RG --name $diskId --cachin None --vm-name asd-sles15sp2-n2 DN=asd_shared_disk_152_data az disk create -g $RG -n $DN -z 256 --sku Premium_LRS --max-shares 2 diskId=$(az disk show -g $RG -n $DN --query 'id' -o tsv); echo $diskId az vm disk attach -g $RG --name $diskId --cachin None --vm-name asd-sles15sp2-n1 az vm disk attach -g $RG --name $diskId --cachin None --vm-name asd-sles15sp2-n2
General prerequisites for a basic HA cluster in Azure
1. Update SUSE Linux Enterprise Server
After logging in to the virtual machine, the first thing is to update SLES with the latest patches:
sudo SUSEConnect -r $REGCODE sudo zypper up -y sudo reboot
NOTE: The subscription registration code (REGCODE) can be retrieved from scc.suse.com
2. SBD must require watchdog
On all nodes, to enable and load softdog:
sudo modprobe softdog; echo "softdog"|sudo tee /etc/modules-load.d/softdog.conf
NOTE: `softdog` is the only watchdog in the public cloud.
To verify if softdog is loaded:
sudo sbd query-watchdog
3. Prepare SBD Partition on one node
SBD disk needs a very small space of 4MiB size [REF: SLE HA Guide – SBD Partition]
On one node, use `lsblk` to confirm the correct device name “/dev/sdX” for the cooresponding shared disks, which might change along with the OS configuration when reboot. SLES tries but doesn’t guarantee the persistent device names of the disks per reboot:
sudo lsblk
So, a partition label for partitions is intentionally assigned:
sudo parted /dev/sdc mklabel GPT sudo parted /dev/sdc mkpart sbd-sles152 1MiB 5MiB sudo parted /dev/sdd mklabel GPT sudo parted /dev/sdd mkpart asd-data1 10GiB 20GiB sudo parted /dev/sdd mkpart asd-data2 20GiB 30GiB
From another node, to verify the disk:
sudo partprobe; sleep 5; sudo ls -l /dev/disk/by-partlabel/
Bootstrap a basic HA cluster
1. Manually setup passwordless ssh for root login among the nodes
On both nodes, run:
sudo ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa <<<y 2>&1 >/dev/null
Then, append /root/.ssh/id_rsa.pub of both nodes to /root/.ssh/authorized_keys
Verify passwordless does work:
asd-sles15sp2-n1:~> sudo ssh asd-sles15sp2-n2 asd-sles15sp2-n2:~> sudo ssh asd-sles15sp2-n1
2. Now, let’s bootstrap the basic cluster
asd-sles15sp2-n1:~> sudo crm cluster init -y -u -s /dev/disk/by-partlabel/sbd-sles152 -A 10.0.0.9
NOTE: To focus on Azure shared disk topic, this blog will not cover how to make the VIP(10.0.0.9) works with Azure Load Balancer.
Wait until Node 1 finishes, then let Node 2 join:
asd-sles15sp2-n2:~> sudo crm cluster join -y -c asd-sles15sp2-n1
To monitor the cluster status:
asd-sles15sp2-n1:~> sudo crm_mon -rR
3. Basic cluster tuning in Azure
STEPS: Modify SBD configuration
In virtualization environment, the OS reboot is very fast usually. To reduce the unexpected failover chaos in certain situation, it is highly recommended to change SBD_DELAY_START to “yes”, and adjust “TimeoutSec=” for systemd sbd.service according to the description in /etc/sysconfig/sbd. The side effect will add more than two minute delay to the cluster initialization phase, which means the longer cluster recover time (RTO) as well.
sudo augtool -s set /files/etc/sysconfig/sbd/SBD_DELAY_START yes
Either propagate this change manually to all nodes, or let csync2 do it:
sudo csync2 -xv
On all nodes, manually add a systemd drop-in file to change “TimeoutSec=”:
sudo mkdir /etc/systemd/system/sbd.service.d echo -e "[Service]\nTimeoutSec=144" | sudo tee /etc/systemd/system/sbd.service.d/sbd_delay_start.conf sudo systemctl daemon-reload
[Service]
TimeoutSec=144
STEPS: Fine-tune SBD on-disk metadata
The sbd watchdog timeout in the on-disk metadata is used by the sbd daemon to initialize the watchdog driver. The default is 5 seconds. To add robustness against the foreseeable hiccup from the public cloud provider in case of planned maintenance activities, it might be good to enlarge the watchdog timeout to 60 seconds and the associated `msgwait` to 120 seconds. [REF: SLE HA Guide – Setting Up SBD with Devices]
Changing the SBD on-disk metadata requires to recreate the SBD disk.
On one node:
SBD_DEVICE=/dev/disk/by-partlabel/sbd-sles152 sudo sbd -d ${SBD_DEVICE} -1 60 -4 120 create
To verify the sbd device meta data:
sudo sbd -d ${SBD_DEVICE} dump
To re-initiate the watchdog driver you must restart the sbd daemon:
sudo crm cluster run "crm cluster restart"
STEPS: SBD stonith test
After the cluster is bootstrapped and running, we can play with SBD a bit. [REF: SLE HA Guide – Testing SBD and Fencing]:
SBD_DEVICE=/dev/disk/by-partlabel/sbd-sles152 sudo sbd -d ${SBD_DEVICE} message asd-sles15sp2-n2 reset
STEPS: Fine-tune the corosync timeout
Similarly to SBD, to add robustness against the foreseeable hiccup from the public cloud provider in case of planned maintenance activities, it might be good to enlarge corosync token timeout. However, be aware that this sacrifices the server recovery time (RTO) for a real permanent failure, since corosync will take longer time to detect the failure in general. With that, it probably makes sense to change corosync `token` timeout to 30 seconds, and the associated `consensus` timeout to 36 seconds
Edit corosync.conf on all nodes of the cluster. REF: `man corosync.conf`:
sudo vi /etc/corosync/corosync.conf
token: 30000
consensus: 36000
Either propagate this change manually to all nodes, or let csync2 do it:
sudo csync2 -xv
On one node, let all nodes reload the corosync config:
sudo corosync-cfgtool -R
To verify the change:
sudo /usr/sbin/corosync-cmapctl |grep -w -e totem.token -e totem.consensus
Active-Passive NFS server
WARNING: This blog is only a weak implementation of an active-passive NFS cluster. If applications require to reclaim locks to work properly, a more advanced NFS solution should be implemented.
1. Prepare lvm and the filesystem
On one node, execute the following steps.
STEPS: Modify the lvm2 configuration
To edit:
sudo vi /etc/lvm/lvm.conf
To set `system_id_source = “uname”`. It is “none” by default, which means lvm won’t continue to operate any volume group which has been created with a systemid.
To set `auto_activation_volume_list = []` to prevent any volume group get activated automatically at the OS level. Indeed, the volume group of HA LVM must be activated by Pacemaker at the cluster level. NOTE: you can ignore this configure for any `shared` volume group (aka `lvmlockd` volume group), since it requires the cluster components and can’t be activated without the cluster stack.
To verify:
sudo lvmconfig global/system_id_source sudo lvmconfig activation/auto_activation_volume_list
system_id_source=”uname”
auto_activation_volume_list=[]
Either propagate this change manually to all nodes, or let csync2 do it:
sudo csync2 -xv
STEPS: create the logical volume
sudo pvcreate /dev/disk/by-partlabel/asd-data1 sudo vgcreate vg1 /dev/disk/by-partlabel/asd-data1 sudo lvcreate -l 50%VG -n lv1 vg1
STEPS: Initialize the filesystem superblock
sudo mkfs.xfs /dev/vg1/lv1
2. Fine-tune NFS server configuration
To adjust NFSV4LEASETIME to reduce the overall fail over time reasonably, change and verify /etc/sysconfig/nfs on all nodes:
sudo augtool -s set /files/etc/sysconfig/nfs/NFSV4LEASETIME 60
3. Bootstrap HA NFS server
On one node, run the following commands:
sudo crm configure \ primitive p_nfsserver systemd:nfs-server \ op monitor interval=30s sudo crm configure \ primitive p_vg1 LVM-activate \ params vgname=vg1 vg_access_mode=system_id \ op start timeout=90s interval=0 \ op stop timeout=90s interval=0 \ op monitor interval=30s timeout=90s sudo crm configure \ primitive p_fs Filesystem \ op monitor interval=30s \ op_params OCF_CHECK_LEVEL=20 \ params device="/dev/vg1/lv1" directory="/srv/nfs" fstype=xfs
NOTE: The `Filesystem` resource agent will create ‘directory=’ if it not exists yet
sudo crm configure \ primitive p_exportfs exportfs \ op monitor interval=30s \ params clientspec="*" directory="/srv/nfs" fsid=1 \ options="rw,mp" wait_for_leasetime_on_stop=true
NOTE: By design, pacemaker might distribute `p_vg1 p_fs p_nfsserver p_exportfs admin-ip` on different nodes before creating the group resource as below. Resource failures might be reported accordingly, and they are false positives.
sudo crm configure group g_nfs p_vg1 p_fs p_nfsserver p_exportfs admin-ip
To clean up known false-positive failures:
sudo crm_resource -C
To verify the NFS server:
suse@asd-sles15sp2-n2:~> sudo showmount -e asd-sles15sp2-n1 suse@asd-sles15sp2-n2:~> sudo showmount -e asd-sles15sp2-n2
Export list for asd-sles15sp2-n1:
/srv/nfs *
Active-Active OCFS2 cluster filesystem
1. Launch `dlm` and `lvmlockd` daemons
This must be done before creating a `shared` logical volume on one node [REF: SLE HA Guide]:
sudo crm configure \ primitive dlm ocf:pacemaker:controld \ op monitor interval=60 timeout=60 sudo crm configure \ primitive lvmlockd lvmlockd \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \ op monitor interval=30 timeout=90 sudo crm configure group g_ocfs2 dlm lvmlockd sudo crm configure clone c_ocfs2 g_ocfs2 meta interleave=true
2. Prepare lvm2 `shared` disks on one node
sudo ls -l /dev/disk/by-partlabel/ sudo pvcreate /dev/disk/by-partlabel/asd-data2 sudo vgcreate --shared vg2-shared /dev/disk/by-partlabel/asd-data2 sudo lvcreate -an -l 50%VG -n lv1 vg2-shared sudo crm configure \ primitive p_vg_shared LVM-activate \ params vgname=vg2-shared vg_access_mode=lvmlockd activation_mode=shared \ op start timeout=90s interval=0 \ op stop timeout=90s interval=0 \ op monitor interval=30s timeout=90s sudo crm configure modgroup g_ocfs2 add p_vg_shared
3. Prepare ocfs2 on one node
sudo mkfs.ocfs2 /dev/vg2-shared/lv1
4. Finally run ocfs2 on all nodes
sudo crm configure \ primitive p_ocfs2 Filesystem \ params device="/dev/vg2-shared/lv1" directory="/srv/ocfs2" fstype=ocfs2 \ op monitor interval=20 timeout=40 \ op_params OCF_CHECK_LEVEL=20 \ op start timeout=60 interval=0 \ op stop timeout=60 interval=0 sudo crm configure modgroup g_ocfs2 add p_ocfs2
To write some text file to ocfs2 at the current node:
asd-sles15sp2-n1:~> echo "'Hello' from `hostname`" | sudo tee /srv/ocfs2/hello_world
To verify from the other node:
asd-sles15sp2-n2:~> cat /srv/ocfs2/hello_world
End of the exercise, enjoy!
Related Articles
Feb 27th, 2023
Comments
Thank you for this great article .
I just wanted to mention that for the volume group you need to update it with the system id so that cluster would be able to detect it and move using the below command:
vgchange –systemid $(uname -n) vg1