Help! Saptune says, my system is degraded!
Recently we got again questions about the system state in the output of saptune status
, so it’s time to talk about it.
If everything is fine, you should get an output like this:
# saptune status ... system state: running ...
But, if not, then you will get:
# saptune status ... system state: degraded ...
A degraded system sounds awful!
The state comes directly from the command systemctl is-system-running
. If we check the man page of systemctl
, we find:
... is-system-running Checks whether the system is operational. This returns success (exit code 0) when the system is fully up and running, specifically not in startup, shutdown or maintenance mode, and with no failed services. ... degraded The system is operational but one or more units failed. ... ...
So, the reason for a degraded system is also mostly a failed unit.
To figure out which unit failed, you can run either saptune_check
or systemctl list-units --state=failed
:
# saptune_check
...
[WARN] System is in status "degraded". Failed services are: saptune.service -> Check the cause and reset the state with systemctl reset-failed
!
...
# systemctl list-units --state=failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● saptune.service loaded failed failed Optimise system for running SAP workloads
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
In this case saptune.service
failed for some reason and caused the degraded state.
If we investigate further, we can see why:
# systemctl status saptune.service ● saptune.service - Optimise system for running SAP workloads Loaded: loaded (/usr/lib/systemd/system/saptune.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-08-11 14:52:20 CEST; 12min ago Process: 2048 ExecStart=/usr/sbin/saptune service apply (code=exited, status=1/FAILURE) Main PID: 2048 (code=exited, status=1/FAILURE) Aug 11 14:52:20 sles4sap15sp3 systemd[1]: Starting Optimise system for running SAP workloads... Aug 11 14:52:20 sles4sap15sp3 saptune[2048]: ERROR: found an active sapconf, so refuse any action Aug 11 14:52:20 sles4sap15sp3 systemd[1]: saptune.service: Main process exited, code=exited, status=1/FAILURE Aug 11 14:52:20 sles4sap15sp3 systemd[1]: saptune.service: Failed with result 'exit-code'. Aug 11 14:52:20 sles4sap15sp3 systemd[1]: Failed to start Optimise system for running SAP workloads.
Ah, saptune
refused to start, because sapconf
has already tuned the system!
And this is the very reason, why saptune is printing the system state in the first place.
In the past we had often seen customer setups where both tools have been mixed up with strange results.
To spot such an easy solvable problem, both saptune status
and saptune_check
report issues with systemd’s system state.
But not always sapconf.service
or saptune.service
are the once listed as failed, but other units.
In such cases saptune status
found issues, which have most likely nothing to do with saptune
itself and most times not even will prevent saptune
from doing its job.
So, if saptune
will work, why reporting it and raise concerns?
Well, not reporting it would mean, hiding potential problems deliberately.
We think, you should know if something might be wrong and there could be a problem lurking in the shadows, which you haven’t spotted yet and waits to strike at the most inconvenient time!
The feedback we got so far, confirms this decision. Lately this even helped to discover and fix a bug in a service that had nothing to do with saptune and might not have been found for some time. Mission accomplished.
So, if you see a degraded system state and neither sapconf.service
or saptune.service
are involved, most certainly your tuning for SAP workload is fine. Best check it out with saptune note verify
to be on the safe side.
Nevertheless you should investigate the reasons for the failed units to be sure that they don’t indicate a bigger problem.
By the way, in the upcoming version 3.1 we will rename it to systemd system status
and add a few explanatory lines to the output, so that it s more obvious what is going on and what to do next.
So long!
Related Articles
Feb 21st, 2023