Slurm and Hardware Accelerated OpenGL
Until recently our lab's GPU cluster had only really been used for deep learning. There hasn't been the need to mess about with OpenGL, or anything non-CUDA, even for Blender, since rendering could be done using Cycles which doesn't need OpenGL. The latest version of Blender, while apparently offering a huge amount of extra functionality, needs an up to date version of OpenGL, version 3.3, far too new for the Mesa OpenGL.
I would have thought such a setup would be fairly well documented online. Either I'm using the wrong search terms or they keep all of this secret? Not sure! From my research online, most HPC setups seem to allocate an entire node to each job. This is not practical for us due to having up to eight GPUs in a server, and only a 10 or so servers. I suspect in such a setup, an Xserver would be running on all GPUs continuously. Anyway, getting it working wasn't too bad but it does require a few things to be just right.
- Despite the recent security bug with X, the setuid bit needs to be set.
- The xserver pam configuration needs to be modified.
- A rather messy X configuration needs to be generated.
- A utility script will simplify job submission, but needs to be written!
I'll go into a little more detail on each, and dump any config files or scripts where they might be useful, in the following sections.
Implementation Details
setuid
Back in 2018 there was a bug with the way the X server handled
logging, which allowed a regular user to overwrite the shadow password
file, allowing the removal of the root password entirely. This was
easily avoided by removing the setuid
bit, which starts the
process as the program owner - in this case root. This has since been
patched, so hopefully re-setting the setuid bit is not an issue.
chmod a+s /usr/bin/Xorg
Note I'm not 100% sure this is still required,
having made the modifications to the xconfig
PAM
configuration. I will test one day and update this post. Anyway, since
Xserver has been patched, this is not currently a big deal as far as I
know.
PAM
On most Linux distributions (possibly all?), starting the X server
requires the user to have logged in from the console. This is, of
course, the standard approach to using a desktop PC. For security
reasons, this is enforced through the Pluggable Authentication Modules
setup. Modifying the /etc/pam.d/xserver
file to the
following should do the trick.
#%PAM-1.0
auth required pam_permit.so
account required pam_permit.so
session optional pam_keyinit.so force revoke
The default PAM config has the auth
sections set to
pam_rootok
being sufficient and pam_console
being required. Setting it to pam_permit
will just allow it
regardless.
Xorg config
For this setup to work, a separate X Layout has to be generated for each GPU. This is because GPUs are assigned on a per-job basis, and so there needs to be a separate X session for every GL job submitted.
The script below iterates through all NVIDIA GPUs and generates an Xserver Layout, tied to each PCI Bus ID, in the format that Xserver expects. It also generates a virtual screen and display, which is required if you don't want have a monitor hooked up to each GPU! The mouse and keyboard inputs can be shared by all layouts.
#!/bin/bash
ID=0
lspci | grep NVIDIA | grep VGA | \
awk '{ print $1 }' | \
while read pci_id ; do
device=PCI:$((0x${pci_id%:*})):0:0
name=$(lspci -s $pci_id | sed 's/.*\[//' | sed 's/\].*//')
cat <<EOF
Section "ServerLayout"
Identifier "Layout$ID"
Screen 0 "Screen$ID"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
Option "IsolateDevice" "$device"
EndSection
Section "Monitor"
Identifier "Monitor$ID"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device$ID"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "$name"
BusID "$device"
Screen 0
EndSection
Section "Screen"
Identifier "Screen$ID"
Device "Device$ID"
Monitor "Monitor$ID"
DefaultDepth 24
Option "UseDisplayDevice" "none"
SubSection "Display"
Depth 24
Modes "1920x1080"
EndSubSection
EndSection
EOF
ID=$((ID + 1))
done
cat <<EOF
Section "InputDevice"
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/input/mice"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
Identifier "Keyboard0"
Driver "kbd"
EndSection
EOF
This script can then be called with the output being redirected to
the usual /etc/X11/Xorg.conf
file.
Job Submission Utility Script
This script is pretty disgusting. Let me just dump it below -
hopefully the comments will do a good enough job at explaining what is
going on. The general idea is that we want to figure out the physical
GPU which has been assigned to the Slurm job, since under normal
circumstances, we just have a device ordinal which is offset by cgroups.
Hence, requesting a single GPU will always leave
CUDA_VISIBLE_DEVICES
set to 0. We can figure out which GPU
has been assigned by inspecting the output of nvidia-smi
,
since it will only show the allocated GPU, for its PCI ID. This PCI ID
can then be used to look up the X server layout from the X
configuration, allowing us to start the display on the correct GPU.
Attempting to start an Xserver on a GPU not assigned to the Slurm job
will fail, again, thanks to cgroups limiting our access. Finally, we
wrap our x11vnc server in a subshell and put it into the background so
that it can restart once a user disconnects. I may add a password option
to all of this, instead of specifying the nopw
option, but
to be honest for our setup, this is not particularly crucial. Of course,
firewall rules are added to allow the our login node access to all ports
on all compute nodes, making port forwarding from SSH easy for the
user.
# Make sure we are running with a GPU allocated
if [ -z "$CUDA_VISIBLE_DEVICES" ] ; then
echo "No GPU requested. Exiting."
exit 1
fi
# Utility function for getting user's email address.
function get_email () {
ldapsearch -h ldap.cs.nott.ac.uk -x \
-b "dc=cs,dc=nott,dc=ac,dc=uk" "uid=$1" | grep "mail:" | \
head -n1 | awk -F':' '{ print $2 }'
}
# First we need to figure out the physical ID of the GPU we have been
# assigned by Slurm.
PCI_ID=$(nvidia-smi --query-gpu=gpu_bus_id --format=csv \
| tail -n1 | cut -b10-)
PCI_ID=PCI:$((0x${PCI_ID%:*})):0:0
# We need to map this physical ID to a X11 Layout which has been
# pre-mapped in the Xorg config file.
PRIMARY_GPU=$(cat /etc/X11/xorg.conf | \
grep -B4 ".*IsolateDevice.*${PCI_ID}" | \
grep -o Layout[0-9] | grep -o [0-9])
echo "Using GPU $PRIMARY_GPU on $(hostname)"
export DISPLAY=:${PRIMARY_GPU}
# Console redirection is required in the case of X as it will not
# start unless there is a pseudoteletype allocated to it. Easiest
# thing to do is to pipe null into it.
echo "Starting X Server..."
X -layout Layout${PRIMARY_GPU} $DISPLAY </dev/null 2>/dev/null &
sleep 3
RESOLUTION=${RESOLUTION:-1900x950}
echo "Setting a screen resolution of $RESOLUTION"
xrandr --fb $RESOLUTION
sleep 2
# There is a chance that the assigned PORT might be unavailable, but
# this seems fairly unlikely given that there are a maximum of 8 X
# sessions on any of our GPU servers.
PORT=$(( 5900 + RANDOM / 500 ))
echo "Starting VNC server on port $PORT..."
(
# This is nested as a subshell to allow it to reopen if the client
# disconnects.
while true ; do
x11vnc -q -nopw -rfbport $PORT 2>/dev/null
done
) &
printf "%70s\n" | tr ' ' '-'
echo "To access VNC, set up the following port forwarding:"
echo " ssh -L5900:$HOSTNAME:$PORT $USER@${SLURM_SUBMIT_HOST}"
echo "and VNC to localhost:5900"
printf "%70s\n" | tr ' ' '-'
mail -s "VNC info for $SLURM_JOB_ID" \
`get_email $USER` <<EOF
To access VNC, set up the following port forwarding:
ssh -L5900:$HOSTNAME:$PORT $USER@${SLURM_SUBMIT_HOST}
and VNC to localhost:5900
EOF
echo "Job starting now."
# libGL is not symlinked to the nvidia library, so we'll just stick it
# in the LD_PRELOAD variable to ensure it is loaded on each program
# start.
export LD_PRELOAD=/lib64/libGLX_nvidia.so.0
# A window manager is not strictly necessary but it does make the
# whole thing quite a bit easier to use without the overhead of a full
# desktop environment.
xfwm4 2>/dev/null &
Putting the Pieces Together
Handling the configuration for all GPU servers can be managed elegantly using a short Ansible role:
---
- name: setup Xorg stuff
hosts: nodes_gpu
tasks:
- name: enable setuid Xorg
shell: chmod a+s /usr/bin/Xorg
- name: generate Xorg config
shell: /usr2/sbin/xconfig > /etc/X11/xorg.conf
- name: Modify pam xserver config
copy:
content: |
#%PAM-1.0
auth required pam_permit.so
account required pam_permit.so
session optional pam_keyinit.so force revoke dest: /etc/pam.d/xserver
Just for the sake of clarity, /usr2
is our NFS mounted
file system for statically compiled binaries, source code and scripts.
Our OpenGL utility script is also stored in this file system, ans so, an
example Slurm job submission script would look similar to this:
#!/bin/bash
#SBATCH --gres gpu
source /usr2/share/gl.sbatch # The messy script above
blender
Fortunately all this complexity (or mess, if you'd rather call it that) is easily abstracted away from the user, who can simply check their email and find the command to execute to setup a VNC port forwarding. In terms of managing access to GPUs, it's exactly as it would be for CUDA stuff. It also allows multiple users on a single GPU node, just as it would for running CUDA jobs. Finally, it does not require an X server to be running for each GPU continuously - they only take up a small amount of video RAM, but that's memory which might be needed by a user's CUDA job.
Quite a fun thing to hack together.
Wanting to leave a comment?
Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).