All Pacemaker nodes stuck UNCLEAN (offline) after corosync update
This document (000019604) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise 15
SUSE Linux Enterprise 15 SP1
SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 15 SP1
SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 15 SP1
Situation
This is caused by the Corosync Ring ID jumping to a massive number which is greater than accepted by ringid consumers like Pacemaker and DLM.
The problem starts when running mixed versions of corosync which includes one of the problem versions listed below on a node. When the problematic corosync version leaves the cluster, the corosync ringid bumps to a massive number on the remaining online nodes.
Symptoms:
1. crm status shows all nodes "UNCLEAN (offline)"
2. After starting pacemaker.service pacemaker-controld will fail in a loop. Journal logs will show:
pacemaker-controld[17625]: error: Input I_ERROR received in state S_STARTING from reap_dead_nodes pacemaker-controld[17625]: notice: State transition S_STARTING -> S_RECOVERY pacemaker-controld[17625]: warning: Fast-tracking shutdown in response to errors pacemaker-controld[17625]: error: Start cancelled... S_RECOVERY pacemaker-controld[17625]: error: Input I_TERMINATE received in state S_RECOVERY from do_recover pacemaker-controld[17625]: notice: Disconnected from the executor3. When a node is attempting to join the cluster corosync will log messages like:
corosync[2704]: [TOTEM ] Received memb_merge_detect message is too short... ignoring. corosync[2704]: [TOTEM ] Received memb_join message is too short... ignoring. or: corosync[2704]: [TOTEM ] Received message corrupted... ignoring.
4. Checking corosync Ring ID will show a massive number (in billions range). In this example 4294967616:
# corosync-quorumtool
Quorum information
------------------
Date: Thu Apr 2 20:00:00 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 1
Ring ID: 1/4294967616
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
1 1 10.10.0.11 (local)
2 1 10.10.0.12
3 1 10.10.0.13
Resolution
1. Stop cluster services on all cluster nodes. On each node run:
crm cluster stop2. Remove corrupt ring id files on all nodes. On each node run:
rm /var/lib/corosync/ringid_*3. Update corosync to version or greater:
SLE15 SP0: corosync-2.4.4-5.6.1
SLE15 SP1: corosync-2.4.4-9.6.1
4. Start pacemaker on all cluster nodes. On each node run:
crm cluster start
Pacemaker and DLM should also be updated to allow for the larger ringid. These are recommended but not required to fix the corruption problem. This is for corosync ringid consumers to accept the uint64_t, instead of the old uint32_t value.
Update Pacemaker/DLM to version or greater:
SLE15 SP0:
DLM: [To be determined]
Pacemaker: pacemaker-1.1.18+20180430.b12c320f5-3.21.1
SLE15 SP1:
DLM: [To be determined]
Pacemaker: pacemaker-2.0.1+20190417.13d370ca9-3.9.1
Cause
Cluster nodes that are running mixed versions of corosync cannot join each other. And if a node running the problematic version of corosync leaves for whatever reason, the remaining online nodes get unexpected massive ringid. Even if the cluster nodes eventually get a consistent version of corosync, a patched pacemaker and dlm will be needed to be able to correctly handle the larger ringid. Otherwise clustering on all nodes will need to be stopped at the same time and /var/lib/corosync/ringid_* files need to be removed to workaround the ringid issue.
This compatibility issue of corosync breaks the "Cluster Rolling Update" methood from any previous version of corosync to the problematic corosync version, or from the problematic version to newer versions.
The problematic patch has now been retracted in the latest maintenance update. If you have updated to the problematic corosync version you will have to perform a "Cluster Offline Update" because "Cluster Rolling Update" compatibility was broken in this case. Additionally while pacemaker is stopped for the update the corrupt /var/lib/corosync/ringid_* files must be removed.
Problem versions:
SLE15 SP0: corosync-2.4.4-5.3.1
SLE15 SP1: corosync-2.4.4-9.3.1
Fixed versions:
SLE15 SP0: corosync-2.4.4-5.6.1
SLE15 SP1: corosync-2.4.4-9.6.1
Status
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000019604
- Creation Date: 08-Apr-2020
- Modified Date:29-Apr-2020
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com