Pacemaker: the configured stop timeout is not respected when stopping an OCFS2 filesystem resource.
This document (000020860) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server for SAP Applications 12
Situation
Here is an example of a Filesystem resource configuration:
primitive myfs Filesystem \
params device="/dev/mapper/3600axxxxxxxxxxx" directory="/myfs" fstype=ocfs2 options=acl \
op monitor interval=0 timeout=10 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=240
And here is an example of the error while trying to stop the resource:
crmd[8888]: notice: Initiating stop operation myfs_stop_0 locally on fileserver2p
lrmd[8888]: notice: executing - rsc:myfs action:stop call_id:102
Filesystem(myfs)[99999]: INFO: Running stop for /dev/mapper/3600axxxxxxxxxxx on /myfs
Filesystem(myfs)[99999]: INFO: Trying to unmount /myfs
Filesystem(myfs)[99999]: ERROR: Couldn't unmount /myfs; trying cleanup with TERM
Filesystem(myfs)[99999]: INFO: No processes on /myfs were signalled. force_unmount is set to 'yes'
Filesystem(myfs)[99999]: ERROR: Couldn't unmount /myfs, giving up!
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ umount: /myfs: target is busy. ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs; trying cleanup with TERM ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ umount: /myfs: target is busy. ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs; trying cleanup with KILL ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs, giving up! ]
lrmd[8888]: notice: finished - rsc:myfs action:stop call_id:102 pid:99999 exit-code:1 exec-time:7050ms queue-time:0ms
crmd[8888]: notice: Result of stop operation for myfs on node2: 1 (unknown error)
In the above example, the resource is configured with a stop time-out of 240 seconds, but the logs show the resource failing after 7050ms (7 seconds).
Resolution
# crm ra info ocf:Filesystem
fast_stop (boolean, [yes]): fast stop
Normally, we expect no users of the filesystem and the stop
operation to finish quickly. If you cannot control the filesystem
users easily and want to prevent the stop action from failing,
then set this parameter to "no" and add an appropriate timeout
for the stop operation.
Based on the resource configuration example described in the above "Situation" section, the resource configuration should look similar to this:
primitive myfs Filesystem \
params device="/dev/mapper/3600axxxxxxxxxxx" directory="/myfs" fstype=ocfs2 options=acl fast_stop=no \
op monitor interval=0 timeout=10 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=240
Cause
# /usr/lib/ocf/resource.d/heartbeat/Filesystem
# Umount all sub-filesystems mounted under $MOUNTPOINT/ too.
local timeout
for SUB in `list_submounts $MOUNTPOINT` $MOUNTPOINT; do
ocf_log info "Trying to unmount $SUB"
if ocf_is_true "$FAST_STOP"; then
timeout=6
else
timeout=${OCF_RESKEY_CRM_meta_timeout:="20000"}
timeout=$((timeout/1000))
fi
fs_stop $SUB $timeout
rc=$?
if [ $rc -ne $OCF_SUCCESS ]; then
ocf_exit_reason "Couldn't unmount $SUB, giving up!"
fi
done
The "FAST_STOP" was changed to default to "no" in the resource agents package v4.7.x. In the following link, it is an explanation of the confusion that the old default value caused:
https://github.com/ClusterLabs/resource-agents/commit/57b6019ffc141c803d879df2352e699fbb72f7dc
Set OCF_RESKEY_fast_stop_default="no" for RHEL and CentOS major releases
9 and above, and for all other distros.
In the past, this attribute has defaulted to "yes", which has caused a
lot of confusion for users. fast_stop preempts the resource's stop
timeout, causing the agent to give up on unmounting the filesystem after
six seconds and declare a stop failure. (The resource operation does not
time out.)
The existence of a stop operation timeout renders fast_stop unnecessary,
and users typically expect that the agent will keep trying to unmount
the filesystem until the full stop operation timeout expires.
The resource agents package v4.7.x is not available on SLES 12 based systems, so the "FAST_STOP" parameter must be set to "no" in the resource configuration, as described in the above "Resolution" section.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020860
- Creation Date: 16-Nov-2022
- Modified Date:03-Mar-2023
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com