All Pacemaker nodes stuck UNCLEAN (offline) after corosync update

This document (000019604) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise 15
SUSE Linux Enterprise 15 SP1
SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 15 SP1
SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 15 SP1

Situation

After updating corosync the cluster will no longer show nodes online.
This is caused by the Corosync Ring ID jumping to a massive number which is greater than accepted by ringid consumers like Pacemaker and DLM.

The problem starts when running mixed versions of corosync which includes one of the problem versions listed below on a node. When the problematic corosync version leaves the cluster, the corosync ringid bumps to a massive number on the remaining online nodes.

Symptoms:
1. crm status shows all nodes "UNCLEAN (offline)"

2. After starting pacemaker.service pacemaker-controld will fail in a loop. Journal logs will show:

pacemaker-controld[17625]:  error: Input I_ERROR received in state S_STARTING from reap_dead_nodes
pacemaker-controld[17625]:  notice: State transition S_STARTING -> S_RECOVERY
pacemaker-controld[17625]:  warning: Fast-tracking shutdown in response to errors
pacemaker-controld[17625]:  error: Start cancelled... S_RECOVERY
pacemaker-controld[17625]:  error: Input I_TERMINATE received in state S_RECOVERY from do_recover
pacemaker-controld[17625]:  notice: Disconnected from the executor

3. When a node is attempting to join the cluster corosync will log messages like:

corosync[2704]:   [TOTEM ] Received memb_merge_detect message is too short...  ignoring.
corosync[2704]:   [TOTEM ] Received memb_join message is too short...  ignoring.
or:
corosync[2704]:   [TOTEM ] Received message corrupted... ignoring.

4. Checking corosync Ring ID will show a massive number (in billions range). In this example 4294967616:

# corosync-quorumtool

Quorum information
------------------
Date:             Thu Apr  2 20:00:00 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1/4294967616
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
        1          1  10.10.0.11 (local)
        2          1  10.10.0.12
        3          1  10.10.0.13

Resolution

If running the problematic corosync version a Cluster Offline Update will be required.
1. Stop cluster services on all cluster nodes. On each node run:

    crm cluster stop

2. Remove corrupt ring id files on all nodes. On each node run:

    rm /var/lib/corosync/ringid_*

3. Update corosync to version or greater:
SLE15 SP0: corosync-2.4.4-5.6.1
SLE15 SP1: corosync-2.4.4-9.6.1

4. Start pacemaker on all cluster nodes. On each node run:

    crm cluster start

Pacemaker and DLM should also be updated to allow for the larger ringid. These are recommended but not required to fix the corruption problem. This is for corosync ringid consumers to accept the uint64_t, instead of the old uint32_t value.
Update Pacemaker/DLM to version or greater:
SLE15 SP0:
DLM: [To be determined]
Pacemaker: pacemaker-1.1.18+20180430.b12c320f5-3.21.1
SLE15 SP1:
DLM: [To be determined]
Pacemaker: pacemaker-2.0.1+20190417.13d370ca9-3.9.1

Cause

The ringid corruption culprit is the latest maintenance updates of corosync for SLE15 SP0 and SLE15SP1, which were to support link local ipv6 (bsc#1163460), but unfortunately introduced compatibility issue.

Cluster nodes that are running mixed versions of corosync cannot join each other. And if a node running the problematic version of corosync leaves for whatever reason, the remaining online nodes get unexpected massive ringid. Even if the cluster nodes eventually get a consistent version of corosync, a patched pacemaker and dlm will be needed to be able to correctly handle the larger ringid. Otherwise clustering on all nodes will need to be stopped at the same time and /var/lib/corosync/ringid_* files need to be removed to workaround the ringid issue.

This compatibility issue of corosync breaks the "Cluster Rolling Update" methood from any previous version of corosync to the problematic corosync version, or from the problematic version to newer versions.

The problematic patch has now been retracted in the latest maintenance update. If you have updated to the problematic corosync version you will have to perform a "Cluster Offline Update" because "Cluster Rolling Update" compatibility was broken in this case. Additionally while pacemaker is stopped for the update the corrupt /var/lib/corosync/ringid_* files must be removed.

Problem versions:
SLE15 SP0: corosync-2.4.4-5.3.1
SLE15 SP1: corosync-2.4.4-9.3.1

Fixed versions:
SLE15 SP0: corosync-2.4.4-5.6.1
SLE15 SP1: corosync-2.4.4-9.6.1

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:000019604
Creation Date: 08-Apr-2020
Modified Date:29-Apr-2020
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com