Simulating a Cluster Network Failure
This document (7017617) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise High Availability Extension 15
Situation
This is normally done on a physical level by removing the network cable or a switch to simulate the real world scenario that the OS has no control or indication of an issue apart from the cluster not being able to communicate anymore.
In many cases this preferred solution is not applicable, maybe there is only virtual machines as cluster nodes with no physical connection that can be removed. Or the removal of the physical connection would be to difficult or might affect other areas and is un-desireable at the moment.
Please keep in mind that bringing down the interface with, for example
ifdown eth1
is NOT recommended. This will most likely only cause other and erratic issues. Disabling an interface in this or any comparable other way is not recommended and not a valid test for the cluster communication.
As further argument against an
ifdown eth1
this removes locally the IP, so any local Application or Service that relies on this part of the network will get an error. Meaning that this test will actually not trigger a cluster communication issue but most likely also a local resource failure.
Resolution
Assuming the setup would be to use
Node A uses local IP 192.168.20.193 for cluster communication
Node B uses local IP 192.168.20.228 for cluster communication
The idea is to block the communication of the nodes. This can be done by implementing a Firewall Rule on one node, to
not send to the other ip
and
not receive from the other ip
Coming from the above Example with Node A and Node B, one can implement this by setting on Node B
iptables -A INPUT -s 192.168.20.193 -j DROP; iptables -A OUTPUT -d 192.168.20.193 -j DROP
which means that all Traffic coming from source 192.168.20.193 , which is Node A, and all Traffic going to 192.168.20.193, which is Node A, will be dropped by the Kernel on Node B.
This breaks the Cluster Communication apart without removing or influencing any relevant local Network settings and without System Notification to any Service, Socket or Application.
For the cluster stack this appears to be a split brain.
You can at any time with
iptables -F
flush the iptables rules to remove this.
Which might be especially useful as a split brain might lead to the node with the iptables rules being the survivor. But this means that the other node reboots into a split brain and might reboot the formerly surviving node because of startup fencing.
Keep in mind that -F removes all these Rules, so using the iptables / Firewall for something else might have an affect on other Areas.
Please also keep in mind that if the IP's used for cluster communication are also used for Applications then there might not only be a Cluster Split Brain, but also a Resource Failure.
Additional Information
physical separation
or
iptables
is the recommended way to test cluster communication with corosync clusters.
See also: Corosync-and-ifdown-on-active-network-interface
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7017617
- Creation Date: 19-May-2016
- Modified Date:05-Nov-2021
-
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com