Monday, July 15, 2013

Enabling CUDA Multi Process Service (MPS) with multiple GPUs.

(Edited 05/09/2016)
CUDA 7 introduced MPS support for multi GPU nodes.
CUDA_VISIBLE_DEVICE should not be used to handle GPU affinity when a CUDA-aware MPI is used, because of issues with CUDA IPC.

(Edited 10/21/13 to use MPS control daemon instead of MPS server)

CUDA 5.5 has a new interesting feature, called CUDA Multi Process Service (MPS), for GPUs with compute capability 3.5.

CUDA MPS, formerly known as CUDA Proxy,  is a feature that allows multiple CUDA processes to share a single GPU context. NVIDIA officially supports configurations with a single GPU, but it is possible to run it  on systems with multiple GPUs creating the MPS servers manually.
This post will show how to enable this feature when multiple GPUs are present in a system.
It is an unsupported but working configuration.

The first thing to do it is to create a MPS control daemon for each GPU.
We will use CUDA_VISIBLE_DEVICES to select each GPU and create two directories in /tmp for each MPS control daemon.  one for the pipe, the other for the log. By default, CUDA MPS will try to create a log directory in /var/log, requiring the control daemon to be executed with root privileges. By selecting a log directory in /tmp ( or any other directory of your choice that is accessible from normal users), we don't need root privileges to start the control daemons.

#!\bin\bash


# Number of gpus with compute_capability 3.5  per server
NGPUS=2

# Start the MPS server for each GPU
for ((i=0; i< $NGPUS; i++))

do
 mkdir /tmp/mps_$i
 mkdir /tmp/mps_log_$i
 export CUDA_VISIBLE_DEVICES=$i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i
 nvidia-cuda-mps-control -d
end do

Once we have set up the control daemons  we need to point the CUDA executable we want to run to the right MPS control daemon.  This is done in a non-standard way.
Instead of using the CUDA_VISIBLE_DEVICES variable, as normally done with CUDA, we will need to set CUDA_VISIBLE_DEVICES to 0 and select the explicit MPS pipe we want to use by specifying the proper CUDA_MPS_PIPE_DIRECTORY.


To start two instances of a.out on GPU 0 using proxy, we will type:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out

The execution script is a little more complex if we are running a MPI application.
In this case, we will need to find a way to detect how many MPI processes are running on a node.
OpenMPI has a variable that will tell us this info, other MPI implementations offer similar environment
variables.

This script shows how to run local process 0 and 2 on GPU 0 and 1 and 3 on GPU 1.

#!/bin/bash
#run script for MPI
export CUDA_VISIBLE_DEVICES=0
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
[0])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    ./executable
    ;;
[1])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1
    ./executable
    ;;

[2])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    ./executable
    ;;
[3])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1
    ./executable
    ;;
esac

Once the execution is completed, we need to clean up the MPS control daemons  if other users are supposed to run on the system.

#!/bin/bash

# Stop the MPS control daemon for each GPU and clean up /tmp

for ((i=0; i< $NGPUS; i++))
do
echo $i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 echo "quit" | nvidia-cuda-mps-control
 rm -fr /tmp/mps_$i
 rm -fr /tmp/mps_log_$i
done

The creation and clean-up could be combined in a single script.