Recommended update for slurm_23_02
Announcement ID: | SUSE-RU-2023:4334-1 |
---|---|
Rating: | moderate |
References: | |
Affected Products: |
|
An update that has one fix can now be installed.
Description:
This update for slurm_23_02 fixes the following issues:
-
Updated to version 23.02.5 with the following changes:
-
Bug Fixes:
- Revert a change in 23.02 where
SLURM_NTASKS
was no longer set in the job's environment when--ntasks-per-node
was requested. The method that is is being set, however, is different and should be more accurate in more situations. - Change pmi2 plugin to honor the
SrunPortRange
option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of theMpiParams=ports=
option, and previously were only limited by the systems ephemeral port range. - Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured.
- Fix and prevent reoccurring reservations from overlapping.
job_container/tmpfs
- Avoid attempts to share BasePath between nodes.- With
CR_Cpu_Memory
, fix node selection for jobs that request gres and--mem-per-cpu
. - Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks.
- Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over.
- Fix
slurmctld
segfault when a node registers with a configuredCpuSpecList
whileslurmctld
configuration has the node withoutCpuSpecList
. - Fix cloud nodes getting stuck in
POWERED_DOWN+NO_RESPOND
state after not registering byResumeTimeout
. slurmstepd
- Avoid cleanup ofconfig.json-less
containers spooldir getting skipped.- Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
- Properly handle a race condition between
bind()
andlisten()
calls in the network stack when running with SrunPortRange set. - Federation - Fix revoked jobs being returned regardless of the
-a
/--all
option for privileged users. - Federation - Fix canceling pending federated jobs from non-origin clusters which could leave federated jobs orphaned from the origin cluster.
- Fix sinfo segfault when printing multiple clusters with
--noheader
option. - Federation - fix clusters not syncing if clusters are added to a federation before they have registered with the dbd.
node_features/helpers
- Fix node selection for jobs requesting changeable. features with the|
operator, which could prevent jobs from running on some valid nodes.node_features/helpers
- Fix inconsistent handling of&
and|
, where an AND'd feature was sometimes AND'd to all sets of features instead of just the current set. E.g.foo|bar&baz
was interpreted as{foo,baz}
or{bar,baz}
instead of how it is documented:{foo} or {bar,baz}
.- Fix job accounting so that when a job is requeued its allocated node
count is cleared. After the requeue, sacct will correctly show that
the job has 0
AllocNodes
while it is pending or if it is canceled before restarting. sacct
-AllocCPUS
now correctly shows 0 if a job has not yet received an allocation or if the job was canceled before getting one.- Fix intel OneAPI autodetect: detect the
/dev/dri/renderD[0-9]+
GPUs, and do not detect/dev/dri/card[0-9]+
. - Fix node selection for jobs that request
--gpus
and a number of tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. - Remove
MYSQL_OPT_RECONNECT
completely. - Fix cloud nodes in
POWERING_UP
state disappearing (getting set toFUTURE
) when anscontrol reconfigure
happens. openapi/dbv0.0.39
- Avoid assert / segfault on missing coordinators list.slurmrestd
- Correct memory leak while parsing OpenAPI specification templates with server overrides.- Fix overwriting user node reason with system message.
- Prevent deadlock when
rpc_queue
is enabled. slurmrestd
- Correct OpenAPI specification generation bug where fields with overlapping parent paths would not get generated.- Fix memory leak as a result of a partition info query.
- Fix memory leak as a result of a job info query.
- For step allocations, fix
--gres=none
sometimes not ignoring gres from the job. - Fix
--exclusive
jobs incorrectly gang-scheduling where they shouldn't. - Fix allocations with
CR_SOCKET
, gres not assigned to a specific socket, and block core distribion potentially allocating more sockets than required. - Revert a change in 23.02.3 where Slurm would kill a script's process
group as soon as the script ended instead of waiting as long as any
process in that process group held the stdout/stderr file descriptors
open. That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
PrologEpilogTimeout
) is strongly encouraged to avoid Slurm waiting indefinitely for scripts to finish. - Fix
slurmdbd -R
not returning an error under certain conditions. slurmdbd
- Avoid potential NULL pointer dereference in the mysql plugin.- Fix regression in 23.02.3 which broken X11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in
/etc/hosts
. openapi/[db]v0.0.39
- fix memory leak on parsing error.data_parser/v0.0.39
- fix updating qos for associations.openapi/dbv0.0.39
- fix updating values for associations with null users.- Fix minor memory leak with
--tres-per-task
and licenses. - Fix cyclic socket cpu distribution for tasks in a step where
--cpus-per-task
< usable threads per core. slurmrestd
- ForGET /slurm/v0.0.39/node[s]
, change format of node's energy fieldcurrent_watts
to a dictionary to account for unset value instead of dumping 4294967294.slurmrestd
- ForGET /slurm/v0.0.39/qos
, change format of QOS's field "priority" to a dictionary to account for unset value instead of dumping 4294967294.- slurmrestd - For
GET /slurm/v0.0.39/job[s]
, the 'return code' code field inv0.0.39_job_exit
_code will be set to -127 instead of being left unset where job does not have a relevant return code.
- Revert a change in 23.02 where
-
Other Changes:
- Remove --uid / --gid options from salloc and srun commands. These options did not work correctly since the CVE-2022-29500 fix in combination with some changes made in 23.02.0.
- Add the
JobId
todebug()
messages indicating whencpus_per_task/mem_per_cpu
orpn_min_cpus
are being automatically adjusted. - Change the log message warning for rate limited users from verbose to info.
slurmstepd
- Cleanup per task generated environment for containers in spooldir.- Format batch, extern, interactive, and pending step ids into strings that are human readable.
slurmrestd
- Reduce memory usage when printing out job CPU frequency.data_parser/v0.0.39
- Addrequired/memory_per_cpu
andrequired/memory_per_node
tosacct --json
andsacct --yaml
andGET /slurmdb/v0.0.39/jobs
from slurmrestd.gpu/oneapi
- Store cores correctly so CPU affinity is tracked.- Allow
slurmdbd -R
to work if the root assoc id is not 1. - Limit periodic node registrations to 50 instead of the full
TreeWidth
. Since unresolvablecloud/dynamic
nodes must disable fanout by settingTreeWidth
to a large number, this would cause all nodes to register at once.
Patch Instructions:
To install this SUSE update use the SUSE recommended
installation methods like YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:
-
SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2
zypper in -t patch SUSE-SLE-Product-HPC-15-SP2-LTSS-2023-4334=1
Package List:
-
SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (aarch64 x86_64)
- slurm_23_02-plugins-23.02.5-150200.5.11.2
- slurm_23_02-pam_slurm-23.02.5-150200.5.11.2
- slurm_23_02-lua-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-node-23.02.5-150200.5.11.2
- slurm_23_02-auth-none-23.02.5-150200.5.11.2
- slurm_23_02-sql-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-lua-23.02.5-150200.5.11.2
- slurm_23_02-slurmdbd-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-debugsource-23.02.5-150200.5.11.2
- libnss_slurm2_23_02-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-torque-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-auth-none-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-cray-23.02.5-150200.5.11.2
- libpmi0_23_02-debuginfo-23.02.5-150200.5.11.2
- libpmi0_23_02-23.02.5-150200.5.11.2
- libslurm39-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-munge-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-node-debuginfo-23.02.5-150200.5.11.2
- libslurm39-23.02.5-150200.5.11.2
- slurm_23_02-pam_slurm-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-plugins-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-sview-23.02.5-150200.5.11.2
- slurm_23_02-sview-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-rest-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-plugin-ext-sensors-rrd-23.02.5-150200.5.11.2
- slurm_23_02-rest-23.02.5-150200.5.11.2
- slurm_23_02-munge-23.02.5-150200.5.11.2
- slurm_23_02-cray-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-slurmdbd-23.02.5-150200.5.11.2
- slurm_23_02-devel-23.02.5-150200.5.11.2
- perl-slurm_23_02-23.02.5-150200.5.11.2
- slurm_23_02-sql-23.02.5-150200.5.11.2
- perl-slurm_23_02-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-plugin-ext-sensors-rrd-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-23.02.5-150200.5.11.2
- libnss_slurm2_23_02-23.02.5-150200.5.11.2
- slurm_23_02-debuginfo-23.02.5-150200.5.11.2
- slurm_23_02-torque-23.02.5-150200.5.11.2
-
SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (noarch)
- slurm_23_02-webdoc-23.02.5-150200.5.11.2
- slurm_23_02-doc-23.02.5-150200.5.11.2
- slurm_23_02-config-man-23.02.5-150200.5.11.2
- slurm_23_02-config-23.02.5-150200.5.11.2