SAP HANA Cluster – automated OS patching with SUSE Manager and Salt states
The following article has been contributed by Bo Jin, Sales Engineer at SUSE and Linux Consultant.
Challenge and motivation
SUSE Linux Enterprise Server for SAP Applications is not just a great product for running SAP workloads. SUSE also provides best practice guides for building up reliable SAP HANA SR high availability clusters by using SUSE Linux Enterprise Server for SAP as a solution.
Customers using clusters often struggle with patching SAP systems running SUSE Linux Enterprise Server in a Pacemaker cluster. The main reason is the need to reboot the operating system (OS) after a kernel patch installation. Although SUSE Live Patching is being used, you still need to patch the OS once in a while with all patches within the scheduled “maintenance window”. During the maintenance window, as an SAP Basis administrator, you need to run cluster commands to move, stop and start resources.
But what if you could automate the OS patching by using SUSE Manager and Salt states while keeping SAP HANA downtime short?
I have developed several Salt execution and state modules which interact with the Pacemaker cluster configuration and management tools crm
, crm_mon
and the SAPHanaSR-showAttr
command in order to query the cluster status.
These Salt modules will be used in Salt states which in turn enable a fully automated patching process for SAP HANA SR scale-up clusters.
The solution in brief
The SUSE Best Practices guide SAP HANA System Replication Scale-Up – Performance Optimized Scenario describes the maintenance of a cluster in quite details. The steps, if not automated, must be executed manually. The Salt states, modules, runners and reactors that I have developed and that are described here have been integrated to exactly follow the best practice instructions.
Some of the “golden rules” of working with Pacemaker clusters I strictly follow are:
- “Never change a cluster if the cluster state is not IDLE”
- “Don’t change or configure an SAP HANA master-slave cluster resource if the system replication status is not SOK.”
Based on these rules, the patch workflow has been tested as described below.
The patching workflow
The following section explains the patching workflow at a glance. For a two-node SAP HANA scale-up cluster, the step patch diskless node is not needed, and you can continue with the primary node.
Stage 1: Patch secondary site
- Execute Salt execution to all member nodes of the cluster:
# salt "hana-*" state.apply myhana
- The Salt module will detect the node roles as primary and secondary – and diskless_node in case of a three-node cluster.
- Start with the secondary node.
- The SAP HANA SR scale-up cluster master-slave resource will be set into maintenance mode.
- The secondary node will be patched and rebooted.
- After the secondary node has been restarted, Pacemaker will be started.
- The master-slave resource will be activated (unset maintenance mode).
Stage 2: Patch diskless node (optional)
- Start patching the diskless_node in case of a diskless setup.
- diskless_node will be rebooted after patching.
Stage 3: Patch primary site
- Re-discover the node roles as primary, secondary and diskless_node in case of a three-node cluster.
- Execute Salt states on the primary node.
- Move the master-slave resource to the other node which is secondary at the moment.
- The SAP HANA SR scale-up cluster master-slave resource will be set into maintenance mode.
- The old primary node will be patched and rebooted.
- After the old primary node has been restarted, Pacemaker will be started.
- Clear the pacemaker
cli-ban
location constraint so that this node can be used again as new secondary site. - The master-slave resource will be activated (unset maintenance mode).
- The old primary has become new secondary.
- Now you are finished 😀.
The workflow uses Salt reactor and requisite systems.
- Requisites: The Salt requisite system is used to create relationships between states. This provides a method to easily define inter-dependencies between states.
- Reactors: Salt’s Reactor system allows Salt to trigger actions in response to an event.
Feel free to adjust the reactor and requisites to map the workflow steps to your needs.
High level architecture
Salt modules for SAP Hana System Replication scale-up cluster
My colleagues from SUSE development created a great set of Salt execution modules, which is called salt-shaptools,
that allows us to setup and configure new SAP HANA and NetWeaver clusters. In order to automate the patching of the cluster nodes, I have developed a few additional Salt modules that use crm
, crm_mon
and SAPHanaSR-showAttr
to query SAP HANA cluster resources and nodes status prior to patching the OS.
These execution modules are:
bocrm.check_if_maintenance
bocrm.check_if_nodes_online
bocrm.check_sr_status
bocrm.delete_cli_ban_rule
bocrm.find_cluster_nodes
bocrm.get_dc
bocrm.get_msl_resource_info
bocrm.if_cluster_state_idle
bocrm.is_cluster_idle
bocrm.is_quorum
bocrm.move_msl_resource
bocrm.off_msl_maintenance
bocrm.pacemaker
bocrm.patch_diskless_node
bocrm.set_msl_maintenance
bocrm.set_off_msl_maintenance
bocrm.set_on_msl_maintenance
bocrm.start_pacemaker
bocrm.stop_pacemaker
bocrm.sync_status
bocrm.wait_for_cluster_idle
SUSE Manager in action
In order to create patch and reboot jobs, I also created Salt runner module scripts which call the SUSE Manager API. The main advantage of using SUSE Manager instead of calling the Salt state directly via the cmd
state module using cmd.run
is that, for audit and compliance reasons, we can keep track of records about the patch jobs. These runner modules are:
- checkjob_status.py
- patch_hana.py
- reboot_host.py
More information
More detailed information about the SaltStack configurations, modules and states which I have created for a fully automated patching of SAP HANA Database Scale-up clusters can be found in my GitHub repository at https://github.com/bjin01/salt-sap-patching which is licensed under GPL v3.0. Long live Salt, SUSE Manager and Pacemaker 😀 !
Related Articles
Apr 26th, 2024
What’s New in SUSE ATIP 3.0?
Oct 04th, 2024
Ensuring Business Continuity with SUSE Linux
Sep 04th, 2023