SUSE Support

Here When You Need Us

FireEye cybersecurity monitor causing periods of high CPU utilization, missing cluster heartbeats, and cluster fencing.

This document (000019690) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 12 SP4
(This case occurred on SLES 12 SP4, but it is likely applicable to various other versions of Linux, from SUSE or otherwise.)

Situation

NOTE:  The case described here occurred with Oracle RAC (Real Application Cluster).  Other evidence has shown that similar symptoms can happen with SUSE High Availability (HA) clustering.  It is also possible that similar failures or timeouts could occur with any application which is sensitive to delays. 

Two or more SLES 12 SP4 systems are running a Oracle RAC (Real Application Cluster).  Occasionally, the Oracle software will fence a node because communication timeouts are detected.  The communication timeouts center around the UDP heartbeat of the cluster.  Often, a partial timeout is detected but communication recovers in time to avoid fencing the node.  For example, the Oracle cluster software may log the following warnings:

2020-06-24 22:01:46.116 [OCSSD(12165)]CRS-1612: Network communication with node server1a (1) has been missing for 50% of the timeout interval.  If this persists, removal of this node from cluster will occur in 14.840 seconds
2020-06-24 22:01:51.118 [OCSSD(12165)]CRS-1727: Network communication between this node 'server1b' (2) and node 'server1a' (1) re-established. Node removal no longer imminent.

Resolution

A cybersecurity monitor from FireEye is running on the systems, and in periods of high activity, one of FireEye's components, a real time monitor, is using enough system resources that other processes (even other Real Time processes) cannot get work done.

The FireEye agent process is "xagt" and in this particular case, the version reported was:

# /opt/fireeye/bin/xagt -v
v31.28.4

The excessive activity is apparently caused by interaction of auditd (Linux Audit Daemon) and FireEye's xagt, which also contains an auditing process.

Potential options to deal with the problem behavior are:

Upgrade FireEye's version to 32.x.
-or-
Disable FireEye's real time monitoring.
-or-
Disable linux auditd.

For more details, please see the article published by FireEye at:

https://community.fireeye.com/s/article/000002856

(Access to that article may require an account at fireeye.com.)

Additional Information

System trace (strace) of Oracle's cluster process (OCSSD) was performed, as well as packet analysis of the UDP heartbeat which OCSSD generates.  These showed that all calls by OCSSD to send heartbeat packets were resulting in packets going onto the wire, and those packets were successfully delivered from node to node.  However, there were some periods of time when OCSSD was apparently unable to make system calls, and therefore the heartbeat was interrupted.  Those times when OCSSD lost the ability to make system calls were tracked to periods of excessive activity by FireEye's xagt.

NOTE:  Much of the information in this document comes from 3rd parties and is not directly verified by SUSE.  It is provided as a convenience to our customers who may run into the same or similar issues.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019690
  • Creation Date: 20-Aug-2020
  • Modified Date:24-Feb-2022
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

tick icon

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

tick icon

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

tick icon

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.