SUSE Linux Enterprise Server 15
A linux NFS client (SLES 15) is accessing files on a Azure Files NFS Server. Two or more processes on the client are accessing the same file, frequently opening and closing; and locking and unlocking. Contention for the same file may occur. Normally, this should not cause a problem.
At some point, one of these processes may hang.
It is also possible that the client may start logging warnings such in /var/log/messages such as:
NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
After tcpdump and vmcore analysis, it was determined that Azure Files (the NFS Server) was mishandling the situation. A close/open race condition was created which was not handled correctly and could not be resolved. Any "hanging" that was perceived was actually looping on various client requests and server errors.
Within the code of Azure Files, Microsoft has corrected at least one aspect of their handling of this situation. But it has not been confirmed that all possible variations of this scenario are resolved. Some problems may remain, bringing about similar symptoms.
If this type of problem is still encountered, it is likely that Azure support will need to be contacted to explore the problem more deeply.
Separately, a workaround has also been created by SUSE that should eliminate this kind of race. The NFS client can be told to serialize all "open" operations so they cannot be done in parallel by different processes. Parallel opens should be safe when they are handled in accordance with the NFS 4 protocol specifications. However, if the NFS Server is not reacting to these events correctly, then forcing the client side to do them serially should avoid the server side malfunction.
Note, again, that this is only a workaround. It would be best for the server side to fix it's handling of the situation. But this workaround may be helpful in a couple of ways:
A. If the workaround is successful, then this is a likely indicator that the problem encountered is similar to what has been seen in the past.
B. The workaround may ensure proper functionality in production while Microsoft and SUSE pursue additional analysis of any remaining bugs. Note, however, that analysis will require data to be gathered from a system where the problem is reproducable. Having at least one system which can still reproduce the problem may be necessary.
To use the workaround:
1. The Linux kernel on the SLES NFS client must be relatively up-to-date:
15 SP6 would need at least kernel 6.4.0-150600.23.17
15 SP5 would need at least kernel 5.14.21-150500.55.73.1
15 SP4 would need at least LTSS kernel 5.14.21-150400.24.128.1
15 SP3 would need at least LTSS kernel 5.3.18-150300.59.170.1
2. Create the file:
/etc/modprobe.d/nfs-workaround.conf
containing the line
options nfsv4 serialize_opens=Y
3. Then reboot.
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.