Distributed Learning in PyTorch
This is very old - it's almost definitely all changed now!
This is just a quick brain dump about getting PyTorch to use multiple nodes. I am assuming that the following is available:
- A cluster with Slurm scheduling.
- Two or more nodes with decent NVIDIA GPUs.
- 10GbE or faster connectivity between nodes.
- Everything is already configured for normal day-to-day training via Slurm.
Let's start then.
Quick thoughts on PyTorch I'm still using lua Torch for my research. I think it is a beautifully designed framework. However, it seems to gradually be slipping into a something which is no longer maintained. I'm willing to try PyTorch now that it has hit it's version 1 release, but I'm also going to look into Deep Learning 4 Java with a Clojure wrapper.
PyTorch Installation I refuse to use (Ana|Mini)conda and as such installed PyTorch using pip in a Python 3.6 virtualenv. Our nodes have CUDA 8 pre-installed and are running CentOS 7.4.
Running something I got this running with the ImageNet PyTorch example here. You'll need to download either the full ImageNet or ImageNet-200 from the website to run it.
I made the following changes to main.py
, which allow it
to use the environment variables provided by Slurm.
34c34
< parser.add_argument('-j', '--workers', default=8, type=int, metavar='N',
---
> parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
60,61c60
<
< parser.add_argument('--world-size', default=int(os.environ['SLURM_NPROCS']), type=int,
---
> parser.add_argument('--world-size', default=-1, type=int,
63c62
< parser.add_argument('--rank', default=int(os.environ['SLURM_PROCID']), type=int,
---
> parser.add_argument('--rank', default=-1, type=int,
65d63
<
86,87d83
<
<
105,111d100
< import socket
< print(socket.gethostname() + ' .... ' + str(args.rank))
<
< if args.rank > 0:
< import time
< time.sleep(5)
<
113c102
<
---
>
I have a script called child.sh
which detects some
network settings and starts the Python script. This script assumes you
have your fast network in a 192.168.75.0/24
subnet. Make sure you change this.
#!/bin/bash
SUBNET=192.168.75
# Tell Gloo to use the 10GbE.
GLOO_SOCKET_IFNAME=$(ifconfig | grep -B1 $SUBNET | \
head -n1 | \
awk -F':' '{ print $1 }')
# Figure out which node is going to be king.
MASTER_FILE=$SLURM_JOBID.master
if [ $SLURM_PROCID -eq 0 ] ; then
IP=$(ifconfig | grep $SUBNET | awk '{ print $2 }')
echo "Master IP addr $IP"
MASTER="tcp://$IP:12345"
echo $MASTER > $MASTER_FILE
else
while [ ! -f $MASTER_FILE ] ; do
sleep 1
done
MASTER=$(cat $MASTER_FILE)
fi
python main.py --arch resnet50 \
--batch-size 64 \
--dist-url $MASTER \
--dist-backend gloo \
--multiprocessing-distributed \
/db/pszaj/imagenet/tiny-imagenet-200
Also note that the IP address of a master node is set in
dist-url
. The master's IP URL is distributed to other nodes
by a file in my home directory, which is the same on all machines (via
NFS). The following command allocates two nodes, each with four GPUs.
The Python script itself will use all four GPUs, hence we start two
instances of the script, spread across two nodes.
srun --nodelist=beast,rogue, -n2 -N2 --gres=gpu:4 -qunlimited child.sh
This should work but doesn't perform very well. I have found it
performs much better if I run two instances per machine (and therefore
each GPU is used twice), by specifying -n4
. Our cluster is
too busy to get away with testing this on a larger number of GPUs.
Related posts:
Wanting to leave a comment?
Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).