System hangs with a large number of tasks in uninterruptible sleep waiting for fanotify events
This document (000019761) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12 SP4
SUSE Linux Enterprise Server 12 SP3
Situation
fanotify event/responses which are being polled by McAfee related processes.
crash> sys|grep LOAD LOAD AVERAGE: 52.38, 39.16, 18.12 crash> ps -S RU: 3 IN: 844 UN: 52 crash> foreach UN bt | grep "#2\|#3 " | awk '{print $3 }' | sort | uniq -c|sort -rn 50 fsnotify 50 fanotify_handle_event 1 wait_for_completion_killable 1 schedule_timeout 1 rwsem_down_read_failed 1 call_rwsem_down_read_failedAlmost all the tasks have the same stack trace, for example:
PID: 5393 TASK: ffff933e2c669340 CPU: 0 COMMAND: "sapstartsrv" #0 [ffffacec476d3958] __schedule at ffffffffae716042 #1 [ffffacec476d39e0] schedule at ffffffffae716662 #2 [ffffacec476d39f0] fanotify_handle_event at ffffffffae2ac0e7 #3 [ffffacec476d3a60] fsnotify at ffffffffae2a8a56 #4 [ffffacec476d3b40] do_dentry_open at ffffffffae260b1b #5 [ffffacec476d3b80] path_openat at ffffffffae2722ed #6 [ffffacec476d3c58] do_filp_open at ffffffffae274d1e #7 [ffffacec476d3d60] do_sys_open at ffffffffae262156 #8 [ffffacec476d3db0] mfe_aac_sys_openat at ffffffffc08d1169 [mfe_aac_100606122]McAfee module is loaded and tainting the kernel:
crash> mod -t NAME TAINTS mfe_aac_100606122 OEAll tasks blocked on uninterruptible sleep are waiting on fanotify_handle_event(), except for "Collect FA Evnt" and "nfsd", both of them also have the longest time on UN state. It's interesting, if we check the stack trace of "Collect FA Evnt" task, which is responsible for collecting/validating the fanotify events, it's on a blocked state while waiting for a rw_semaphore lock:
crash> bt ffff933de84f8500 PID: 13567 TASK: ffff933de84f8500 CPU: 1 COMMAND: "Collect FA Evnt" #0 [ffffacec510a7ae8] __schedule at ffffffffae716042 #1 [ffffacec510a7b70] schedule at ffffffffae716662 #2 [ffffacec510a7b80] rwsem_down_read_failed at ffffffffae7194ef #3 [ffffacec510a7be0] call_rwsem_down_read_failed at ffffffffae3c2704 #4 [ffffacec510a7c28] down_read at ffffffffae718a63 #5 [ffffacec510a7c30] lookup_slow at ffffffffae26f966 #6 [ffffacec510a7c88] walk_component at ffffffffae27127f #7 [ffffacec510a7ce0] path_lookupat at ffffffffae2718d9 #8 [ffffacec510a7d38] filename_lookup at ffffffffae2741bc #9 [ffffacec510a7e48] vfs_statx at ffffffffae269204 #10 [ffffacec510a7e98] SYSC_newstat at ffffffffae269af6 #11 [ffffacec510a7f30] do_syscall_64 at ffffffffae003954 #12 [ffffacec510a7f50] entry_SYSCALL_64_after_hwframe at ffffffffae80009a RIP: 00007f5b31955525 RSP: 00007f5aeaff3778 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f5aeaff3780 RCX: 00007f5b31955525 RDX: 00007f5aeaff4b10 RSI: 00007f5aeaff4b10 RDI: 00007f5aeaff3780 RBP: 00007f5aeaff4b10 R8: 000000000000000f R9: 00007f5aeaff46b8 R10: 0000000000000007 R11: 0000000000000246 R12: 00007f5aeaff4830 R13: 00007f5aeaff4830 R14: 000055eaa137d690 R15: 00007f5aeaff8c10 ORIG_RAX: 0000000000000004 CS: 0033 SS: 002bThe semaphore lock is being held/owned by nfsd task:
crash> struct rw_semaphore.owner ffff933a7c4741d0 owner = 0xffff9339a9f31040While "nfsd" task itself is blocked while waiting on fanotify_handle_event() while trying to access an NFS exported file on /usr/sap/trans:
crash> bt ffff9339a9f31040 PID: 48565 TASK: ffff9339a9f31040 CPU: 0 COMMAND: "nfsd" #0 [ffffacec54ca3990] __schedule at ffffffffae716042 #1 [ffffacec54ca3a18] schedule at ffffffffae716662 #2 [ffffacec54ca3a28] fanotify_handle_event at ffffffffae2ac0e7 #3 [ffffacec54ca3a98] fsnotify at ffffffffae2a8a56 #4 [ffffacec54ca3b78] do_dentry_open at ffffffffae260b1b #5 [ffffacec54ca3bb8] dentry_open at ffffffffae261e24 #6 [ffffacec54ca3be8] nfsd_open at ffffffffc078d8fe [nfsd] #7 [ffffacec54ca3c20] nfs4_get_vfs_file at ffffffffc07a79fa [nfsd] #8 [ffffacec54ca3cd0] nfsd4_process_open2 at ffffffffc07ac296 [nfsd] #9 [ffffacec54ca3da8] nfsd4_open at ffffffffc079b707 [nfsd] #10 [ffffacec54ca3e00] nfsd4_proc_compound at ffffffffc079bd78 [nfsd] #11 [ffffacec54ca3e48] nfsd_dispatch at ffffffffc07891dc [nfsd] #12 [ffffacec54ca3e78] svc_process_common at ffffffffc0705447 [sunrpc] #13 [ffffacec54ca3ed0] svc_process at ffffffffc07064e4 [sunrpc] #14 [ffffacec54ca3ef0] nfsd at ffffffffc0788c83 [nfsd] #15 [ffffacec54ca3f10] kthread at ffffffffae0b0186 #16 [ffffacec54ca3f50] ret_from_fork at ffffffffae800235 crash> struct path.mnt ffffacec54ca3bf8 mnt = 0xffff933e79f3e560 crash> struct vfsmount.mnt_sb 0xffff933e79f3e560 mnt_sb = 0xffff933e73076800 crash> mount|grep ffff933e73076800 ffff933e79f3e540 ffff933e73076800 ext4 /dev/mapper/datavg00-usr_sap_trans_lv /usr/sap/trans crash> struct dentry.d_name.name 0xffff933998ad80c0 d_name.name = 0xffff933998ad80f8 "KFS.LOB" crash> struct dentry.d_name.name 0xffff933de0976cc0 d_name.name = 0xffff933de0976cf8 "tmp"It seems we are in a sort of deadlock situations caused by the fanotify events, polled by McAfee related processes.
Resolution
Cause
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000019761
- Creation Date: 27-Oct-2020
- Modified Date:27-Oct-2020
-
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com