HANA Scale-Up HA with System Replication & Automated Failover using SUSE HAE on SLES 12 SP 3 – Part 3
This is a guest blog reposted with the permission of the author Dennis Padia, SAP Technical Consultant & Architect at Larsen and Toubro Infotech.
Big thanks to Bernd Schubert from SUSE for proof-reading the blog.
This blog describes how to install and configure SUSE HAE to automate the failover process in SAP HANA system replication. SUSE HAE is part of the SUSE Linux Enterprise Server for SAP Application and SAP HANA Database Integration.
Procedure (High Level Steps)
- Install the High Availability pattern and the SAPHanaSR Resource Agents
- Basic Cluster Configuration.
- Configure Cluster Properties and Resources.
If you are having separate OS team then have a round table discussion on cluster configuration and setup as they know how to perform it. But if you are pure Basis resource and have little knowledge on OS then I prefer you to read more on the cluster setup to get more insight
Installation of SLES High Availability Extension
Download the installation media and mount the same. Install the SUSE HAE on both the Nodes using the command
# zypper in -t pattern ha_sles
This installs several rpm packages required for SUSE HAE. The installation must be performed on both primary and secondary HANA servers.
Create STONITH Device
STONITH (shoot the other node in the head) is the way to implement fencing in SUSE HAE. If a cluster member is not behaving normally, it must be removed from the cluster. This is referred as fencing. A cluster without the STONITH mechanism is not supported by SUSE. There are multiple ways to implement STONITH, but in this blog, STONITH Block Devices (SBD) is used.
Create a small LUN (1 MB) on the storage array that is shared between the cluster members. Map this LUN to both primary and secondary HANA servers through storage ports. Make note of the SCSI identifier of this LUN (the SCSI identifier should be the same on both primary and secondary HANA servers). It is possible to add more than one SBD device in a cluster for redundancy. If the two HANA nodes are installed on separate storage arrays, an alternate method such as IPMI can be used for implementing STONITH.
Refer to the SUSE Linux Enterprise High Availability Extension SLE HA Guide for best practices for implementing STONITH. The validation of this reference architecture has been performed using shared storage and SBD for STONITH implementation.
# sbd -d <shared lun> dump
All the above timeout parameters are default. But you can change it, but it is advisable to change only if you are encountering any issue in cluster or if guided by SAP/SuSE.
Now the Resource Agents for controlling the SAP HANA system replication needs to be installed
at both cluster nodes.
# zypper in SAPHanaSR SAPHanaSR-doc
Configure SUSE HAE on Primary HANA Server
These steps are used for the basic configuration of SUSE HAE on primary HANA servers. Start the configuration by running the command
# sleha-init
- /root/.ssh/id_rsa already exists – overwrite? [y/N]: Type N
- Network Address to bind: Provide the subnet of the replication network
- Multicast Address: Type the multicast address or leave the default value if using unicast
- Multicast Port: Leave the default value or type the port that you want to use
- Do you wish to use SBD? [y/N]: Type y
- Path to storage device: Type the SCSI identifier of the SBD device created in the step Create STONITH Device (/dev/disk/by-id/scsi-360000970000197700209533031354139)
- Are you sure you want to use this device [y/N]: Type y
Add Secondary HANA Serve to the Cluster
To add the secondary HANA server to the cluster configured on the primary HANA server, run the following command on the secondary HANA server as root user.
# slehajoin
- /root/.ssh/id_rsa already exists – overwrite? [y/N]: Type N
- IP address or hostname of existing node: Enter the primary node replication IP address
Input looks just as below figure on insert of above command
This completes the basic cluster configuration on the primary and secondary HANA servers.
After all of the previous steps are finished, login to Hawk (HA Web Konsole) using the URL ‘https://<Hostname of Primary or Secondary Server>:7630’ with the user ID ‘hacluster’ and password ‘linux’.
The default password can be changed later. You should see the cluster members ‘Server1’ and ‘Server2’ online
NOTE: Sometime it may happens due to firewall, port 7630 might not be open from your desktop or RDP server. So kindly open the port or you can forward the port in putty to your localhost while logging into server (something like below)
NOTE: In above screen you can see “admin-ip” which is the Virtual IP service configured to manage cluster. It might not be available when you configure it.
After setting up cluster, in SLES 11 you need to set below parameter to make cluster working without any issue.
noquorumpolicy = ignore (Obsolete, only applicable for SLES 11)
IMPORTANT STEP: For SLES 12, quorom policy is obsolote but you have to make sure that you set below value in /etc/corosync/corosync.conf file
# Please read the corosync.conf.5 manual page
totem {
version: 2
token: 5000
consensus: 7500
token_retransmits_before_loss_const: 6
secauth: on
crypto_hash: sha1
crypto_cipher: aes256
clear_node_high_bit: yes
interface {
ringnumber: 0
bindnetaddr: **IP-address-for-heart-beating-for-the-current-server**
mcastport: 5405
ttl: 1
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
nodelist {
node {
ring0_addr: **ip-node-1**
nodeid: 1
}
node {
ring0_addr: **ip-node-2**
nodeid: 2
}
}
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}
This changes works like the noquorumpolicy=ignore option
SAPHanaSR Configuration
The SAPHanaSR package can be configured using the Hawk wizard. Follow the procedure on SLES for SAP Applications for configuration steps. Below is a list of the following parameters required by the Hawk wizard:
- SAP SID: SAP System Identifier. The SAP SID is always a 3-character alphanumeric string.
- SAP Instance Number: The instance number must be a two-digit number including a leading zero.
- Virtual IP Address: The Virtual IP Address will be configured on the host where the primary database is running.
Navigate to “Wizards” > SAP> SAP HANA SR Scale-Up Performance Optimized
Virtual IP address is not Client IP of HANA Server. Make sure you provide available IP from your landscape as this IP address will later be registered in DNS with virtual host name.
This virtual host name will be used by your SAP Application Server to connect to HANA database. The advantage of connecting SAP Application via virtual hostname (Virtual IP) is that on failover of HANA database, this virtual IP will also migrate secondary node which automatically connects your SAP Application to HANA database.
Below parameters plays vital role based on the scenario you are deploying.
Parameter | Performance Optimized | Cost Optimized | Multi-Tier |
PREFER_SITE_TAKEOVER | True | False | True/False |
AUTOMATED_REGISTER | False/True | False/True | False |
DUPLICATE_PRIMARY_TIMEOUT | 7200 | 7200 | 7200 |
Parameter | Description |
PREFER_SITE_TAKEOVER | Defines whether RA should prefer to takeover to the secondary instance instead of restarting the failed primary locally. |
AUTOMATED_REGISTER | Defines whether a former primary should be automatically registered to be secondary of the new primary. With this parameter you can adapt the level of system replication automation.
If set to false the former primary must be manually registered. The cluster will not start this SAP HANA RDBMS till it is registered to avoid double primary up situations. |
DUPLICATE_PRIMARY_TIMEOUT | Time difference needed between two primary time stamps, if a dual-primary situation occurs. If the time difference is less than the time gap, than the cluster hold one or both instances in a “WAITING” status. This is to give a admin the chance to react on a failover. If the complete node of the former primary crashed, the former primary will be registered after the time difference is passed. If “only” the SAP HANA RDBMS has crashed, then the former primary will be registered immediately. After this registration to the new primary all data will be overwritten by the system replication. |
Verify cluster resources and click on “Apply”
In status screen you will see that the resources are registered and now virtual IP is assigned to primary server i.e. 4021
After the configuration of Performance Optimized system replication between two nodes in a cluster, the graphical representation of the configuration looks as below
rsc_SAPHanaTopology_SLH_HDB00 – Manages two HANA Database in the system replication. In our case, it is HANA database residing on XXXXXXXXX4021 and YYYYYYYY4022 server.
rsc_SAPHana_SLH_HDB00 – Analyses SAP HANA System Replication topology. This Resource Agent (RA) analyzes the SAP HANA topology and “sends” all findings via the node status attributes to all nodes in the cluster. These attributes are taken by the SAPHana RA to control the SAP Hana Databases. In addition, it starts and monitors the local saphostagent.
rsc_ip_SLH_HDB00 – This Linux-specific resource manages IP alias IP addresses. On creating resource, virtual IP will be attached to primary site and this virtual IP will to move to secondary in case of failover.
Constraints
As you can see in the graphical representation of the resource configuration in cluster there are some Resource Constraints have been defined. It specifies –
- on which cluster nodes resources can run
- in which order resources will be loaded
- what other resources a specific resource depends on
Below are the two constraints that gets generated automatically on registering resource in constraints. In case if it is not generate, you can define it manually.
Colocation Constraints – col_saphana_ip_SLH_HDB00
A colocational constraint tells the cluster which resources may or may not run together on a node.
To create a location constraint, specify an ID, select the resources between which to define the constraint, and add a score. The score determines the location relationship between the resources.
- Positive values: The resources should run on the same node.
- Negative values: The resources should not run on the same node.
- Score of INFINITY: The resources have to run on the same node.
- Score of -INFINITY: The resources must not run on the same node.
An example for use of a colocation constraint is a Web service that depends on an IP address. Configure individual resources for the IP address and the Web service, then add a colocation constraint with a score of INFINITY. It defines that the Web service must run on the same node as the IP address. This also means that if the IP address is not running on any node, the Web service will not be permitted to run.
Here your msl_SAPHana_SLH_HDB00 and rsc_ip_SLH_HDB00 service should run together. So in case of failover, this constraints will check where your master service is running and based on that virtual IP service will run with it.
Order Constraints – ord_SAPHana_SLH_HDB00
Ordering constraints define the order in which resources are started and stopped.
To create an order constraint, specify an ID, select the resources between which to define the constraint, and add a score. The score determines the location relationship between the resources: The constraint is mandatory if the score is greater than zero, otherwise it is only a suggestion. The default value is INFINITY. Keeping the option Symmetrical set to Yes (default) defines that the resources are stopped in reverse order.
An example for use of an order constraint is a Web service (e.g. Apache) that depends on a certain IP address. Configure resources for the IP address and the Web service, then add an order constraint that defines that the IP address is started before Apache is started.
Do’s and Don’t
In your project you should,
- Define STONITH before adding other resources to the cluster
- Do intensive testing
- Tune the timeouts of the operations of SAPHana and SAPHanaTopology
- Start with PREFER_SITE_TAKEOVER=”true”, AUTOMATED_REGISTER=”false” and DUPLICATE_PRIMARY_TIMEOUT=”7200”
In your project, avoid:
- Rapidly changing/changing back cluster configuration, such as: Setting nodes to standby and online again or stopping/starting the master/slave resource.
- Creating a cluster without proper time synchronization or unstable name resolutions for hosts, users and groups
- Adding location rules for the clone, master/slave or IP resource. Only location rules mentioned in this setup guide are allowed.
- As “migrating” or “moving” resources in crm-shell, HAWK or other tools would add client prefer location rules this activity are completely forbidden.
Regards,
Dennis Padia.
Related Articles
Oct 18th, 2024
Updates over IPv6 in the public clouds
Jun 22nd, 2023
Add more power to Prometheus
Aug 13th, 2024
No comments yet