Deploying SLURM PAM modules on SLE compute nodes

Share
Share

Security in High Performance Computing (HPC) environments has always been a proposal of relativity. Striking a balance between implementing an adequate level of security and not interfering with “getting the science done”. In the simplest terms the primary security realms HPC systems and clusters are exposed to are environmental and user facing.cybersecurity padlock graphic More specifically the security posture of the data centers hosting HPC systems and how users are permitted to access them. The security at the cluster edges is certainly a worthwhile discussion. However, this writing focuses on user access security within HPC clusters, where the science is getting done, and security is at its most minimal by design. Implementing user access restrictions to cluster compute nodes may seem unnecessary at first blush.

Why would there be a need to restrict user access to compute nodes? Two primary reasons merit mention.

Maintaining fairness in cluster resource allocation and access.

The high CPU-GPU and memory density of modern HPC compute nodes provide sufficient resources for concurrent distributed workloads. Workloads on a compute node will usually belong to different users, and those workloads are understandably important to their respective owners. Moreover, research workloads may have normal runtimes measured in seconds, weeks or even months. If a user were to access that node and initiate work or processes, not managed by the cluster scheduler or resource management facilities, and cause the node to crash that would certainly not be fair.

Maintaining accuracy in cluster metrics and trends

Realtime metrics are important for active visibility into cluster health and utilisation. Historical health and utilisation data for clusters can also be useful for computational capacity analysis and even insight into future cluster design needs. If users access nodes they have workloads running on and augment or adjust them with additional processes and resource consumption, again not managed by the cluster scheduler or resource management facilities, it would be desirable to capture those metrics as well.

Solution overview

The Simple Linux Utility for Resource Management (SLURM) software stack includes Pluggable Authentication Modules (PAM) that can be used to manage user access to compute nodes in clusters it manages.

Software packages

Installation of the “PAM module for restricting access to compute nodes via SLURM” package (slurm-pam_slurm) on a node where the “Minimal SLURM node” package (slurm-node) is installed provides the following two PAM modules.

/lib64/security/pam_slurm.so: Considered the legacy implementation of the two modules. This module’s functionality is limited to preventing users from logging into nodes where they do not own compute jobs.

/lib64/security/pam_slurm_adopt.so: The preferred and most capable module. In addition to preventing users from logging into nodes where they do not own compute jobs, it tracks other processes spawned by a user’s SSH connection to that node. These processes are adopted as external steps to the user’s job. Those external steps are not only integrated with SLURM’s accounting facilities, but also its control group facilities (cpuset, memory, etc.) to ensure the adopted processes are contained and even terminated properly.

Recommendations

Disable systemd session management on HPC compute nodes

Systemd utilises the “pam_systemd.so” module, in the session module interface of the system PAM stack, to register user sessions with the systemd login manager service. The login manager is invoked by the systemd-logind.service script. Because the systemd module and the login manager service manage the default user control group (cgroup) hierarchy, it conflicts with the pam_slurm_adopt.so module cgroup facilities. Both systemd components can be disabled for login services on HPC compute nodes.

More on how this is accomplished later in the article.

Requirements

SLURM must load the plugin to support cgroups and requires the addition of the “contain” ProLog flag.

Modify the /etc/slurm/slurm.conf file:

TaskPlugin=task/cgroup
PrologFlags=contain

* Not required by the pam_slurm.so module.

Best practice recommendations

The use of the “UsePAM=1” directive and value in the slurm.conf file should be understood before it is implemented. It is not required by either the “pam_slurm.so” or “pam_slurm_adopt.so” modules. Rather it is used to provision a user’s environment on a compute node instead of the standard user profile captured from an origin login or submission node.

The use of the “UsePAM=1” directive and value also requires the use of a custom PAM file to implement the desired environment. An example is provided below.

/etc/pam.d/slurm:

auth                     required        pam_localuser.so
account               required        pam_unix.so
session                 required        pam_limits.so

Using the “UsePAM=1” directive and value in the slurm.conf file and the custom slurm PAM file provide an alternate method of enforcing resource limits in environments when the pam_slurm.so module is used. In most environments it should not be required.

The SSH daemon on compute nodes should support the use of PAM services to authenticate users.

Modify the /etc/ssh/sshd_config file:

UsePam yes

Best practice recommendations

Require key based authentication for root user logins and disable simple password-based authentication.

PermitRootLogin without-password
PubkeyAuthentication yes
ChallengeResponseAuthentication yes
PasswordAuthentication no

* Challenge Response Authentication supports modern forms of authentication in addition to prompting for, accepting, and validating passwords.

Deployment

This deployment example demonstrates the configuration of the “pam_slurm_adopt.so” module.

On the target compute nodes.

  1. Ensure the /etc/slurm/slurm.conf file requirements are met.
  1. Ensure the /etc/ssh/sshd_config file requirements are met.
  1. Install the required packages.

~# zypper install slurm-pam_slurm

  1. Create local groups on compute nodes that will be used to permit access to administrative users irrespective of compute job ownership.

/etc/group:

hpc_admin_g:x:11001:root,admin

  1. Modify the configuration file used by the “pam_access.so” module (to be implemented in a later step) to support the administrator user access.

/etc/security/access.conf:

+:hpc_admin_g:192.168.0.0/24
+:root:ALL
-:ALL:ALL

This configuration permits access to members of the “hpc_admin_g” group from the “192.168.0.0/24” network, to user “root” from all networks, and denies access to all others.

* Reference the man file for the pam_access.so module for additional information on this configuration.

  1. Create custom PAM files for SLURM module implementation

Create new files in the /etc/pam.d directory with the “-pc” suffix. Then create symbolic links to those files (conforming to the standards in the /etc/pam.d directory). These new files will be referenced in the yet to be modified /etc/pam.d/sshd file.

The use of custom files retains as much default content in the PAM service files as possible. This is useful in the event the system needs to be rolled back to the default authentication services and to ensure system patches do not modify customised content.

~# cd /etc/pam.d

~# cp common-account-pc slurm-common-account-pc
~# ln -s slurm-common-account-pc slurm-common-account

Modify the new file.

/etc/pam.d/slurm-common-account-pc:

#%PAM-1.0

account    required     pam_unix.so     try_first_pass
account    optional     pam_sss.so      use_first_pass
account    sufficient   pam_access.so
account    required    pam_slurm_adopt.so

* This configuration uses the SSSD for Active Directory authentication in addition to local UNIX authentication. The “pam_sss.so” module may not be present in other configurations.

~# cp common-session-pc slurm-common-session-pc
~# ln -s slurm-common-session-pc slurm-common-session

Modify the new file.

/etc/pam.d/slurm-common-session-pc:

#%PAM-1.0

# session optional        pam_systemd.so
session required        pam_limits.so
session required        pam_unix.so     try_first_pass
session optional        pam_sss.so
session optional        pam_umask.so
session optional        pam_env.so

* The pam_systemd.so module is removed from the customised configuration.

Modify the PAM configuration file used by the SSH service to reference the customised files.

/etc/pam.d/sshd:

#%PAM-1.0

auth                     requisite   pam_nologin.so
auth                     include     common-auth
account               requisite   pam_nologin.so
account               include     slurm-common-account
password            include     common-password
session                required    pam_loginuid.so
session                include     slurm-common-session
session                optional    pam_lastlog.so   silent noupdate showfailed

  1. Disable and mask the systemd login service daemon.

~# systemctl stop systemd-logind
~# systemctl mask systemd-logind

Taking the new configuration for a walk

The required components are now configured, and the new user access model can be tested against a compute node.

The following commands are issued from the job submission node that has access to compute nodes in the cluster.

SLURM admin successful authentication

Figure 1: Cluster administrators should always be able to log in to compute nodes

 

SLURM user unsuccessful authenticationFigure 2: Users without active jobs should not be able to log in to compute nodes.

 

SLURM user successful authentication

Figure 3: When a user has an active job login to compute nodes is permitted.

 

SSH session adopted into compute job

Figure 4: User SSH sessions and their child processes are adopted into user jobs on compute nodes.

Summary

Implementing the SLURM PAM modules can certainly aid in improving the inward facing security, aid in node reliability, and the accounting facets of a cluster. The configuration footprint is relatively lightweight and easy to deploy. The availability of the modules is one of the many benefits of using the SLURM cluster management stack, enjoy!

Share
(Visited 1 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *

No comments yet

Avatar photo
11,443 views
Lawrence Kearney Lawrence has over 20 years of experience with SUSE and Red Hat Enterprise Linux solutions, and supporting and designing enterprise academic and corporate computing environments. As a member of the Technology Transfer Partners (TTP) Advisory board Lawrence actively contributes and promotes the efforts of international academic computing support missions. Currently working for the Georgia High Performance Computing Center (GACRC) at the University of Georgia supporting high performance and research computing infrastructures Lawrence publishes articles on many of his solutions as time permits.