Add more power to Prometheus

Share
Share

A while ago

I published this article: https://www.suse.com/c/discover-the-hidden-treasure/ . Today, I would like to share my most recent experiences with you. This post will show you the value collectd can provide for your Prometheus-based observability solution. Let me give you a brief introduction of collectd. Collectd is a daemon that reads various system statistics, e.g. the operating system, applications, logfiles, and external devices. It stores this information or makes it available over the network. Statistics are very fine-grained. This information can be used to monitor systems, find performance bottlenecks and predict future system load.

The power of Collectd Plugins

In the former blog post, I was using the prometheus-node_exporter textfile.collector to extend the metrics data with hardware information. I couldn’t get the disk information directly from an exporter because the disks are “hidden” behind a Raid-Controller. All physical properties including S.M.A.R.T. data are not visible. Today I will show you another option to get additional information transferred into a metric and process them with Prometheus, Grafana, and the Prometheus alert manager.

For collectd we provide many plugins to collect data. To store or export data I like to focus on the Prometheus export method. To make this happen the collectd.conf must look like this:

# view /etc/collectd.conf
...
LoadPlugin write_prometheus
...
<Plugin write_prometheus>
 Port "9103"
</Plugin>
...

The “exec” Plugin

I would like to show you two scripts that demonstrate how you can transfer additional information into a metric. The first example shows how to check if there is a reboot pending and the second one demonstrates checking the HANA logsegment file flags (Free, RetainedFree, Truncated, …)

Knowing is better than thinking

Imagine the following scenario, security patches that require a reboot are rolled out to your servers. A reboot is triggered remotely, but one or more of your systems doesn’t do the reboot. On these system the newly applied patches are not yet active!

There are multiple ways to check this e.g.:

  • looking for the uptime,
  • verify if the expected version of a package is running,
  • or checking if the “needs-restarting” flag is set for this host

The tool needs-restarting list running processes that might still use files and libraries deleted or updated by recent upgrades. The following script is used to collect the information and handover to collectd.

# cat /opt/needs-reboot.sh

#!/bin/bash
HOSTNAME="${COLLECTD_HOSTNAME:-`hostname -f`}"
INTERVAL="${COLLECTD_INTERVAL:-60}"

while sleep "$INTERVAL"; do
      RB=$(needs-restarting -r 2>&1 >/dev/null; echo $?)
        echo "PUTVAL \"$HOSTNAME/exec-needs-restart/gauge-reboot_required\" interval=$INTERVAL N:$RB"
done
Now we enable collectd using the script. First, we activate the plugin and expose them like a Prometheus exporter.
# vi /etc/collectd.conf
...
LoadPlugin exec
LoadPlugin write_prometheus
...
<Plugin exec>
     Exec "collectd:users" "/opt/needs-reboot.sh"
</Plugin>
...
<Plugin write_prometheus>
Port "9103"
</Plugin>
...

As collectd is not authorized to run scripts a root, we must create a new user who can execute the script.

# useradd -m -g users -s /bin/bash -r collectd
However, collectd still required additional more privileges to work correctly. If not already done, create a copy of the original systemd service file of collectd.service. The collectd exec plugin requires these privileges: CAP_SETUID CAP_SETGID .
# /etc/systemd/system/collectd.service
[Unit]
Description=Collectd statistics daemon
Documentation=man:collectd(1) man:collectd.conf(5)
After=local-fs.target network-online.target
Requires=local-fs.target network-online.target

[Service]
ExecStart=/usr/sbin/collectd
EnvironmentFile=-/etc/sysconfig/collectd
EnvironmentFile=-/etc/default/collectd

# A few plugins won't work without some privileges, which you'll have to specify using the CapabilityBoundingSet directive below.
#
# Here's a (incomplete) list of the plugins known capability requirements:
#   ceph            CAP_DAC_OVERRIDE
#   dns             CAP_NET_RAW
#   exec            CAP_SETUID CAP_SETGID
#   intel_rdt       CAP_SYS_RAWIO
#   intel_pmu       CAP_SYS_ADMIN
#   iptables        CAP_NET_ADMIN
#   ping            CAP_NET_RAW
#   processes       CAP_NET_ADMIN  (CollectDelayAccounting only)
#   smart           CAP_SYS_RAWIO
#   turbostat       CAP_SYS_RAWIO
# By default, drop all capabilities:
CapabilityBoundingSet=CAP_SETUID CAP_SETGID
The next step is to reload systemd and then restart collectd or activate them if it was not in use before.
# systemctl daemon-reload
# systemctl restart collectd.service
or
# systemctl enable --now collectd.service
Finally, check if the new metric is available. This can be done using a web browser by going to http://<ip of the host with collectd running>:9103/metrics, or using curl from a terminal.
# curl http://localhost:9103/metrics/ |grep exec
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 60053  100 60053    0     0  28.6M      0# HELP collectd_exec_gauge write_prometheus plugin: 'exec' Type: 'gauge', Dstype: 'gauge', Dsname: 'value'
 # TYPE collectd_exec_gauge gauge
--collectd_exec_gauge{exec="needs-restart",type="reboot_required",instance="ls3331"} 0 1683782520768

The metric is:

collectd_exec_gauge{exec="needs-restart",type="reboot_required",instance="ls3331"}

A value of 0 means that no reboot is required. In the above example, the first value, a zero, means the a reboot is not required. The second value is a timestamp.

Make it visible

Bringing it all together with Grafana, could look like this:
needs-restarting
Enabling alerts by Prometheus-alertmanager would require a rule like this:
- name: security-patch-monitoring
  rules:
  - alert: reboot-required
    expr: collectd_exec_gauge{type="reboot_required"} == 1
    labels:
    annotations:
      title: Node {{ $labels.instance }} reboot required.
      description: The server {{ $labels.instance }} is marked with the flag reboot-needed. It is mandatory to reboot the host to apply the latest patches.

Build a second line of defense and avoid a disk-full situation

Now let’s do something similar with SAP HANA. Consider the following setup, two HANA nodes configured with HANA System Replication. Replication traffic is done over a dedicated network. The admin, client and monitoring traffic are using a different network segment.
If the system replication network is not operational the HANA DB will be stuck for 30sec (default), after which HANA reacts normally. From this moment all HANA log segment files are retained and the “RetainedFree” flag is applied to them. This flag changes to “Free” after the log data is shipped to the second node. Until system replication resumes, HANA can’t reuse these log segment files and new files need to be created. Log segment files for tenant databases are 1GB in size by default. The number of log segment files will continue to increase and a disk-full situation could happen. With the HANA tool hdblogdiag we can check how many files have the “RetainedFree” flag. By combining hdblogdiag and collectd it is possible to make the information available as metric.
The following script switches user to the sidadm and executes the hdblogdiag to retrieve the required data.
# vi /opt/hanasr-broken.sh
#!/bin/bash
SHOSTNAME="${COLLECTD_HOSTNAME:-`hostname -s`}"
INTERVAL="${COLLECTD_INTERVAL:-60}"
while sleep "$INTERVAL"; do
RB=$(su - ha1adm -c "hdblogdiag seglist /hana/log/HA1/mnt00001/hdb00003.00003/ |grep RetainedFree | wc -l")
    if [ $? -ne 0 ]
    then
    RB="Z"
    fi
    echo "PUTVAL \"$SHOSTNAME/exec-hanasr-broken/gauge-retainedfree\" interval=$INTERVAL N:$RB"
done

Now extend the collectd plugin exec section with the new script:

<Plugin exec>
     Exec "collectd:users" "/opt/needs-reboot.sh"
     Exec "collectd:users" "/opt/hanasr-broken.sh"
</Plugin>

Because the collectd user is performing a user switch, we have to make some additional adjustments.

# vi /etc/systemd/system/collectd.service
...
CapabilityBoundingSet=CAP_SETUID CAP_SETGID
LimitNOFILE=1048576
...
# vi /etc/pam.d/su-l
#%PAM-1.0
auth sufficient pam_rootok.so
auth [success=ignore default=1] pam_succeed_if.so user = ha1adm
auth sufficient pam_succeed_if.so use_uid user = collectd
auth include common-auth
...
Update the systemd deamon with systemctl daemon-reload, restart collectd and check if the new metric is available. If everything went well and the metric is available you can add a new Prometheus-alertrule. E.g.
- name: hana-sr-log_segment-monitoring
rules:
- alert: log-segment_RetainedFree_flag
expr: avg_over_time(collectd_exec_gauge{type="retainedfree"}[1h]) >= 2
for:
labels:
severity: warning
annotations:
title: HANA SR on node {{ $labels.instance }} seems broken.
description: HANA logsegment files on {{ $labels.instance }} with the flag RetainedFree are increasing. Depending on the value logshipping_max_retention_size (default is 1TB per log volume) your system can run into a disk full situation.
SAP HANA is continuously improving. Since HANA 2.0 SPS06 there is a new parameter available that could help to avoid a disk-full situation as well. Check out logshipping_max_retention_size or SAP note 3142505 – Limit the Maximum Size Used by HANA Log Segments.
Thank you for your patience, that’s all for today. Just in case you prefer a partly video version of this post check out SUSECON 23 Digital session [PROD-1076]
See you soon,
Bernd
Share
(Visited 18 times, 1 visits today)
Avatar photo
3,389 views