Distributed Learning in PyTorch

Posted <2018-12-08 Sat 18:44> by Aaron S. Jackson.

This is very old - it's almost definitely all changed now!

This is just a quick brain dump about getting PyTorch to use multiple nodes. I am assuming that the following is available:

A cluster with Slurm scheduling.
Two or more nodes with decent NVIDIA GPUs.
10GbE or faster connectivity between nodes.
Everything is already configured for normal day-to-day training via Slurm.

Let's start then.

Quick thoughts on PyTorch I'm still using lua Torch for my research. I think it is a beautifully designed framework. However, it seems to gradually be slipping into a something which is no longer maintained. I'm willing to try PyTorch now that it has hit it's version 1 release, but I'm also going to look into Deep Learning 4 Java with a Clojure wrapper.

PyTorch Installation I refuse to use (Ana|Mini)conda and as such installed PyTorch using pip in a Python 3.6 virtualenv. Our nodes have CUDA 8 pre-installed and are running CentOS 7.4.

Running something I got this running with the ImageNet PyTorch example here. You'll need to download either the full ImageNet or ImageNet-200 from the website to run it.

I made the following changes to main.py, which allow it to use the environment variables provided by Slurm.

34c34
< parser.add_argument('-j', '--workers', default=8, type=int, metavar='N',
---
> parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
60,61c60
<
< parser.add_argument('--world-size', default=int(os.environ['SLURM_NPROCS']), type=int,
---
> parser.add_argument('--world-size', default=-1, type=int,
63c62
< parser.add_argument('--rank', default=int(os.environ['SLURM_PROCID']), type=int,
---
> parser.add_argument('--rank', default=-1, type=int,
65d63
<
86,87d83
<
<
105,111d100
<     import socket
<     print(socket.gethostname() + ' .... ' + str(args.rank))
<
<     if args.rank > 0:
<         import time
<         time.sleep(5)
<
113c102
<
---
>

I have a script called child.sh which detects some network settings and starts the Python script. This script assumes you have your fast network in a 192.168.75.0/24 subnet. Make sure you change this.

#!/bin/bash

SUBNET=192.168.75

# Tell Gloo to use the 10GbE.
GLOO_SOCKET_IFNAME=$(ifconfig | grep -B1 $SUBNET | \
                         head -n1 | \
                         awk -F':' '{ print $1 }')

# Figure out which node is going to be king.
MASTER_FILE=$SLURM_JOBID.master
if [ $SLURM_PROCID -eq 0 ] ; then
    IP=$(ifconfig | grep $SUBNET | awk '{ print $2 }')
    echo "Master IP addr $IP"
    MASTER="tcp://$IP:12345"
    echo $MASTER > $MASTER_FILE
else
    while [ ! -f $MASTER_FILE ] ; do
        sleep 1
    done
    MASTER=$(cat $MASTER_FILE)
fi

python main.py --arch resnet50 \
       --batch-size 64 \
       --dist-url $MASTER \
       --dist-backend gloo \
       --multiprocessing-distributed \
       /db/pszaj/imagenet/tiny-imagenet-200

Also note that the IP address of a master node is set in dist-url. The master's IP URL is distributed to other nodes by a file in my home directory, which is the same on all machines (via NFS). The following command allocates two nodes, each with four GPUs. The Python script itself will use all four GPUs, hence we start two instances of the script, spread across two nodes.

srun --nodelist=beast,rogue, -n2 -N2 --gres=gpu:4 -qunlimited child.sh

This should work but doesn't perform very well. I have found it performs much better if I run two instances per machine (and therefore each GPU is used twice), by specifying -n4. Our cluster is too busy to get away with testing this on a larger number of GPUs.

Wanting to leave a comment?

Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).

Tags: computervision