Dynamically changing the Cluster Size of an UDPU HAE Cluster
This document (7023669) is provided subject to the disclaimer at the end of this document.
Environment
Situation
Normally, as per default, the HAE Corosync Cluster, is configured as Multicast for corosync communication. One advantage of the Multicast is that it is a network, so nodes can join and leave the cluster without changing corosync.conf.
Some setups in some environments require to use UDPU, which introduces as set of Node definitions in the
nodelist
Section of the corosync.conf file. As such, the members of the cluster are hard coded into one of the main configuration files.
Consequently in such a scenario any addition or removal of cluster nodes requires reload or restart of corosync service, which would basically introduce complete cluster down in the latter case.
Resolution
corosync-cfgtool -R
the Corosync on all members of the cluster ordered to reread the configuration.
The Machines in in the example are
oldhanaa1
oldhanaa2
oldhanad1
oldhanad2
The start is to have a working cluster comprising of the nodes
oldhanad1
oldhanad2
then to add
oldhanaa1
oldhanaa2
and then to remove
oldhanad1
oldhanad2
Essentially it is a move from 2 Node Cluster -> 4 Node Cluster -> 2 Node Cluster. This is actually not a good example as we recommend odd number of cluster nodes, eg 3 or 5, but is picked because it was easy to implement in an existing test setup.
The cluster has fencing via SBD with all nodes able to reach the device
All nodes are in a 2 Ring Scenario, which is not a requirement, but only to show that it works with 2 Rings as well.
The starting configuration on
oldhanad1
oldhanad2
would be
totem {
version: 2
secauth: off
crypto_hash: sha1
crypto_cipher: aes256
cluster_name: nirvanacluster
clear_node_high_bit: yes
token: 10000
consensus: 12000
token_retransmits_before_loss_const: 10
join: 60
max_messages: 20
interface {
ringnumber: 0
bindnetaddr: 10.162.192.0
mcastport: 5405
ttl: 1
}
interface {
ringnumber: 1
bindnetaddr: 192.168.128.0
mcastport: 5405
ttl: 1
}
transport: udpu
rrp_mode: active
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
nodelist {
node {
ring0_addr: 10.162.193.133
ring1_addr: 192.168.128.3
nodeid: 3
}
node {
ring0_addr: 10.162.193.134
ring1_addr: 192.168.128.4
nodeid: 4
}
}
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}
and the cluster looks like
Stack: corosync
Current DC: oldhanad1 (version 1.1.16-6.5.1-77ea74d) - partition with quorum
Last updated: Wed Jan 23 16:00:04 2019
Last change: Wed Jan 23 15:09:21 2019 by root via crm_node on oldhanad2
2 nodes configured
9 resources configured
Online: [ oldhanad1 oldhanad2 ]
Active resources:
killer (stonith:external/sbd): Started oldhanad1
Clone Set: base-clone [base-group]
Started: [ oldhanad1 oldhanad2 ]
test1 (ocf::heartbeat:Dummy): Started oldhanad1
test2 (ocf::heartbeat:Dummy): Started oldhanad2
test3 (ocf::heartbeat:Dummy): Started oldhanad1
test4 (ocf::heartbeat:Dummy): Started oldhanad2
it is advisable to set
no-quorum-policy=freeze
as the cluster will otherwise on loss of quorum stop resources which might defeat the purpose of this procedure
to add the nodes
oldhanaa1
oldhanaa2
the following is done
1) make sure that the cluster is not running on
oldhanaa1
oldhanaa2
2) create the new corosync.conf
in this example
--- corosync.conf 2019-01-23 15:09:32.944006828 +0100
+++ corosync.conf.4node 2019-01-23 13:14:01.845075910 +0100
@@ -44,6 +44,16 @@
nodelist {
node {
+ ring0_addr: 10.162.193.12
+ ring1_addr: 192.168.128.1
+ nodeid: 1
+ }
+ node {
+ ring0_addr: 10.162.193.126
+ ring1_addr: 192.168.128.2
+ nodeid: 2
+ }
+ node {
ring0_addr: 10.162.193.133
ring1_addr: 192.168.128.3
nodeid: 3
@@ -61,6 +71,6 @@
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
- expected_votes: 2
- two_node: 1
+ expected_votes: 4
+ two_node: 0
}
we add 2 Nodes and change the quorum section
3) copy the new corosync.conf onto all servers
oldhanaa1
oldhanaa2
oldhanad1
oldhanad2
4) check all resources that have a local dependency onto a node, for example HANA Databases. If in doubt set these resources to
unmanaged
to prevent them from being moved by the cluster.
5) invoke on one of the already active cluster nodes, in this example
that would be
oldhanad1 or oldhanad2
the command
corosync-cfgtool -R
the Cluster will change now to
Stack: corosync
Current DC: oldhanad1 (version 1.1.16-6.5.1-77ea74d) - partition WITHOUT quorum
Last updated: Wed Jan 23 16:10:20 2019
Last change: Wed Jan 23 15:09:21 2019 by root via crm_node on oldhanad2
2 nodes configured
9 resources configured
Online: [ oldhanad1 oldhanad2 ]
Active resources:
killer (stonith:external/sbd): Started oldhanad1
Clone Set: base-clone [base-group]
Started: [ oldhanad1 oldhanad2 ]
test1 (ocf::heartbeat:Dummy): Started oldhanad1
test2 (ocf::heartbeat:Dummy): Started oldhanad2
test3 (ocf::heartbeat:Dummy): Started oldhanad1
test4 (ocf::heartbeat:Dummy): Started oldhanad2
because a 4 Node Cluster with 2 Nodes does not have quorum.
6) start pacemaker on nodes
oldhanaa1
oldhanaa2
the cluster will change to
Stack: corosync
Current DC: oldhanad1 (version 1.1.16-6.5.1-77ea74d) - partition with quorum
Last updated: Wed Jan 23 16:11:05 2019
Last change: Wed Jan 23 16:10:49 2019 by hacluster via crmd on oldhanad1
4 nodes configured
13 resources configured
Online: [ oldhanaa1 oldhanaa2 oldhanad1 oldhanad2 ]
Active resources:
killer (stonith:external/sbd): Started oldhanad1
Clone Set: base-clone [base-group]
Started: [ oldhanaa1 oldhanaa2 oldhanad1 oldhanad2 ]
test1 (ocf::heartbeat:Dummy): Started oldhanad1
test2 (ocf::heartbeat:Dummy): Started oldhanaa2
test3 (ocf::heartbeat:Dummy): Started oldhanaa1
test4 (ocf::heartbeat:Dummy): Started oldhanad2
Second step, remove to original nodes, meaning, shrink the cluster to 2 nodes again
1) put the nodes to be removed in standby
crm node standby oldhanad1
crm node standby oldhanad2
2) stop the cluster on the nodes to be removed
oldhanad1
oldhanad2
the cluster will change to
Stack: corosync
Current DC: oldhanaa1 (version 1.1.16-6.5.1-77ea74d) - partition WITHOUT quorum
Last updated: Wed Jan 23 16:17:09 2019
Last change: Wed Jan 23 16:16:49 2019 by root via crm_attribute on oldhanad2
4 nodes configured
13 resources configured
Node oldhanad1: OFFLINE (standby)
Node oldhanad2: OFFLINE (standby)
Online: [ oldhanaa1 oldhanaa2 ]
Active resources:
killer (stonith:external/sbd): Started oldhanaa1
Clone Set: base-clone [base-group]
Started: [ oldhanaa1 oldhanaa2 ]
test1 (ocf::heartbeat:Dummy): Started oldhanaa1
test2 (ocf::heartbeat:Dummy): Started oldhanaa2
test3 (ocf::heartbeat:Dummy): Started oldhanaa1
test4 (ocf::heartbeat:Dummy): Started oldhanaa2
3) modify the corosync.conf to
--- corosync.conf 2019-01-23 12:04:53.851763302 +0100
+++ corosync.conf.2Node 2019-01-23 16:15:59.077676093 +0100
@@ -53,16 +53,6 @@
ring1_addr: 192.168.128.2
nodeid: 2
}
- node {
- ring0_addr: 10.162.193.133
- ring1_addr: 192.168.128.3
- nodeid: 3
- }
- node {
- ring0_addr: 10.162.193.134
- ring1_addr: 192.168.128.4
- nodeid: 4
- }
}
@@ -71,6 +61,6 @@
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
- expected_votes: 4
- two_node: 0
+ expected_votes: 2
+ two_node: 1
}
we remove the original 2 Nodes and change the quorum section again
4) copy this modified corosync.conf file onto
oldhanaa1
oldhanaa2
5) remove the old nodes
crm node delete oldhanad1
crm node delete oldhanad2
6) invoke on one of the nodes that still actively run the cluster,
in this example that would be
oldhanaa1 or oldhanaa2
the command
corosync-cfgtool -R
the cluster will change to
Stack: corosync
Current DC: oldhanaa1 (version 1.1.16-6.5.1-77ea74d) - partition with quorum
Last updated: Wed Jan 23 16:21:36 2019
Last change: Wed Jan 23 16:21:34 2019 by root via crm_node on oldhanaa2
2 nodes configured
9 resources configured
Online: [ oldhanaa1 oldhanaa2 ]
Active resources:
killer (stonith:external/sbd): Started oldhanaa1
Clone Set: base-clone [base-group]
Started: [ oldhanaa1 oldhanaa2 ]
test1 (ocf::heartbeat:Dummy): Started oldhanaa1
test2 (ocf::heartbeat:Dummy): Started oldhanaa2
test3 (ocf::heartbeat:Dummy): Started oldhanaa1
test4 (ocf::heartbeat:Dummy): Started oldhanaa2
Additional Information
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7023669
- Creation Date: 24-Jan-2019
- Modified Date:12-Mar-2024
-
- SUSE Linux Enterprise High Availability Extension
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com