ocfs2 on SLES10 NTS sanity check (OCFS2 HEARTBEAT)
This document (7001469) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 10 Service Pack 1
Situation
Resolution
On each and every node check the following either on the system or by checking the appropriate output from supportconfig.sh
1. Kernelversion
The Kernel should be the latest Kernel of a Service Pack. Caveat, there is a version change of ocfs2 from SLES10 SP1 to SLES10 SP2.
Checking on the commandline via
2. Check that the appropriate module is loaded. Make sure this module belongs to the kernel identified in Step 1 and is not a weak update or anything else.
Checking on the commandline via
3. Check that ocfs2 is activated during boot process. Caveat, you are looking for o2cb, not ocfs2 in this case in the system
Checking on the commandline via
4. Check the ocfs2 settings themself. Caveat, the output of /etc/init.d/o2cb status differs from SLES10 SP1 to SLES10 SP2 so it might be best to rely on the config file in sysconfig
Checking on the commandline by looking at / editing
The recommended values are
O2CB_IDLE_TIMEOUT_MS=30000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000
O2CB_HEARTBEAT_MODE="user"
Special attention should be given to O2CB_HEARTBEAT_THRESHOLD which defaults in older versions of SLES10 to 7 which might be ok for testing but not for production.
Caveat, if HEARTBEAT_MODE="user" is used then Heartbeat and STONITH have to be configured to get the ocfs2 running. The reason for this is the following, if you have the mode set to "user", then ocfs2 relies on the cluster communication from heartbeat. But the only way that heartbeat can send a notify is, if the STONITH tells the cluster that a node is gone. So without STONITH the ocfs2 settings are not working with "user" mode. We recommend the use of "user" mode.
5. Check the ocfs2 configuration file
Checking on the commandline by looking at / editing
The syntax of this file is explained by the example. Caveat, this is only an example.
node:
ip_port = 7777
ip_address = 149.44.174.137
number = 0
name = power720-1
cluster = rumburak
node:
ip_port = 7777
ip_address = 149.44.174.138
number = 1
name = power720-2
cluster = rumburak
cluster:
node_count = 2
name = rumburak
The file breaks down in 2 areas, node, where there have to be the entries for the each and every node in the ocfs2 cluster and cluster which is only a summary of the name and the node count. Caveat node count in the nodes section starts with 0 while the node count in the cluster section starts with 1.
If these checks are done, mode is set to "user" and everything seems to be alright, but there are still problems with ocfs2 then the next step should be to check the heartbeat settings.
If mode is "kernel" but there are still problems with ocfs2 then the next step should be to contact NTS.
The heartbeat settings are explained by an example, you get the heartbeat settings by issuing
cibadmin -Q > /tmp/suse.xml
on one node. As the heartbeat settings are the same on all nodes it is not necessary to get the cibadmin from every node.
Relevant entries are in the sections crm_config , example:
<crm_config>
...
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
...
</crm_config>
as stated above, without STONITH activated user mode ocfs2 will not work.
and in the section resources , example:
<resources>
...
<clone id="ocfs2_fs">
<meta_attributes id="ocfs2_fs_meta_attrs">
<attributes>
<nvpair id="ocfs2_fs_metaattr_clone_max" name="clone_max" value="2"/>
<nvpair id="ocfs2_fs_metaattr_clone_node_max" name="clone_node_max" value="1"/>
<nvpair id="ocfs2_fs_metaattr_notify" name="notify" value="true"/>
<nvpair id="ocfs2_fs_metaattr_globally_unique" name="globally_unique" value="false"/>
</attributes>
</meta_attributes>
<primitive id="resource_fs" class="ocf" type="Filesystem" provider="heartbeat">
<instance_attributes id="resource_fs_instance_attrs">
<attributes>
<nvpair id="4ca301b0-142a-4664-b197-9c7385c59f46" name="device" value="/dev/disk/by-id/scsi-1494554000000000000000000030000005e2b00000d000000"/>
<nvpair id="9cc32e5e-f53b-4bf0-a065-74efc0b4e252" name="directory" value="/mnt/t1"/>
<nvpair id="cd560808-810d-4637-b0fb-d9f8636c9a1e" name="fstype" value="ocfs2"/>
</attributes>
</instance_attributes>
<operations>
<op id="493029d4-e225-4f81-89a2-bb7f2f076672" name="monitor" interval="20" timeout="40" start_delay="10" on_fail="fence" disabled="false" role="Started"/>
</operations>
</primitive>
</clone>
...
<resources>
The most common errors here are
- notify true not set
- globally_unique false not set
- device not set to /dev/disk/by-id/ value, which can result in troubles if the numbering of the devices changes, for example iSCSI or SAN devices are possible culprits here
- no monitor action set, which can result in nasty behaviour if an admin does an umount on the commandline, bypassing the heartbeat and heartbeat not realizing that the ocfs2 node is gone.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7001469
- Creation Date: 02-Oct-2008
- Modified Date:25-Feb-2021
-
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com