Cuda and AI/Learning Tools

Most research computing in the CS department and much of our instruction use GPUs with Nvidia’s Cuda software and applications such as Pytorch and TensorFlow. This page describes these technologies.

GPUs and their allocation

Most of our research and larger instructional systems (servers) have 8 Cuda-capable Nvidia GPUs. Desktop systems generally have one smaller GPU that is still Cuda-capable and thus could be used for courses in GPU programming or preliminary software development.

More than one user can share GPUs. However, the memory is limited (typically 12 GB on public systems), so only one or two can use them at a time. To avoid having one user dominate our limited GPUs when running GPU jobs, we require that you use Slurm Job Scheduling software to run your jobs on iLab Servers. No GPUs will be available without Slurm on iLab Servers. Slurm is not required for desktop machines.

Cuda

The Cuda Toolkit is a set of Nvidia APIs designed to make it easy to write programs using GPUs. Among other things, it provides a uniform interface that applies to many different GPU models. There are alternatives, e.g., OpenCL, but Cuda is commonly used here. Cuda has bindings for Python and most other major programming languages. Work in our department uses primarily Python.

We install the latest version of Cuda on all systems with appropriate GPUs. Many users have existing code that requires older versions. You can write and ask to install previous versions on your system. However, this may not be possible. For example, Ubuntu 22 supports Cuda 12, but no older versions exist. Please look at the section on containers below for a way to use older software versions.

Python

In the CS department, GPU-based work is done primarily in Python. We would be happy to support users with other languages, but most tools currently installed are for Python. The most common tool is Pytorch, but we also have some usage of TensorFlow.

Pytorch, Tensorflow, and other tools are installed in Python environments. DO NOT simply type “python.” On many systems, that will give you an older version of Python without access to the GPU-related tools. Instead, you can just use the most recent version of the Python environment. It has most of the significant tools. When possible, we can add additional ones on request.

To run Python, see Using Python on CS Linux Machine. For most purposes, it’s sufficient to type below use it: source /common/system/venv/python312/bin/activate

The general purpose Anaconda environments are located in /common/system/venv.

For those users who need a particular Python setup and know how to use Python, we recommend that you set up your own Python environment where you can install your modules and avoid conflicts with existing modules.

Adding your software

We have tried to install all the commonly used software in our Python environments. If you need more, there are two options:

- Install individual packages using pip install --user That causes them to be installed in your home directory, in ~/.local/lib/pythonM.N. This is a reasonable approach if you have a few packages you need.
- Install your Python environment using a venv.

Running Containers with Singularity

As mentioned above, you may need to use a specific version of pytorch or other software that we can use. For this, we recommend using a container.

A container is, in some respects, like a virtual machine. It has its own set of software. But it’s not as isolated from the underlying operating systems. It has the same users, processes, and user file systems. It is a way of delivering a specific set of software different from what is installed on the central system.

Nvidia supplies official containers with Cuda, Pytorch, Tensorflow, and many other tools. They issue new containers once a month but keep the old ones archived. That lets you get the most reasonable combinations of versions by running the correct container.

Because older versions of Cuda won’t install on Ubuntu 20, if you need Cuda 9 or 10, you’ll have to use a container on Ubuntu 20. However, we’ll use containers even for current software in the long run.

We have downloaded the Nvidia containers you’d most likely want to /common/system/nvidia-containers. In that directory, INDEX files list all available containers and the versions of major software they support. If you need a container that we haven’t provided, we can easily download it. To look at them, do

more /common/system/nvidia-containers/INDEX-pytorch

more /common/system/nvidia-containers/INDEX-tensorflow

In the table at the end, you’ll see entries like

21.05 1.15.5 or 2.4.0, Ubuntu 20, Cuda 11.3.0, Python 3.8
     21.04 1.15.5 or 2.4.0, Ubuntu 20, Cuda 11.3.0, Python 3.8

21.05 is the container version (2021, May). It uses version 1.15.5 or 2.4.0 of TensorFlow, with Ubuntu 20, Cuda 11.3.0, and Python 3.8.

The versions at the left margin are the ones we have. The indented versions are available from Nvidia and can be downloaded if needed.

If you do ls /common/system/nvidia-containers you’ll see a list of the files we have. The containers all end in .sif. The names should match the entries in the index file. e.g. tensorflow:21.05-tf2-py3.sif is 21.05, and the version with Tensorflow 2.4.0. (1.15.5 would be tf1).

To use a container, simply run it with Singularity, e.g.

singularity run --nv /common/system/nvidia-containers/tensorflow:21.05-tf2-py3.sif

Once it starts, you’ll be in a bash shell within the container in your standard home directory. You can then develop and run programs as you normally would. For more info on Singularity Basic commands, Documentation and Examples, and Singularity Tutorial. If you prefer videos, there are also plenty of YouTube videos to help you familiarize yourself.

You can install additional Python software for the container described above, i.e., pip install --user. Because your home directory is the same inside and outside, it works just as it would outside the container. Of course, you can also install your own Python environment. That will also work inside the container, though you’ll have to ensure you have software versions that match the Cuda version supported by the container.

These containers have software intended to run code. They may not have everything you want for development. (In particular, there’s no Emacs text editor.) Thus, you may want to maintain a separate window on the main machine to do things other than running your program. The user files are the same inside and outside the container. Even the processes you see with ps are the same inside and outside the container (though usernames other than yours won’t show inside the container).

Running Containers with Dockers

Running a container with Singularity is the preferred method for CS machines. Please see the Computer Science Docker page if you must run the container.

Running long jobs and GPU jobs

When running a long job, please be aware that we have Limitations Enforced On CS Linux Machines. Please read the instructions on how to go around the restriction.

Because we have limited GPUs, when running GPU jobs, we require that you use Job Scheduling software to run your jobs on iLab servers.

For help with our systems or immediate assistance, visit LCSR Operator at CoRE 235 or call 848-445-2443. Otherwise, see CS HelpDesk. Don’t forget to include your NetID along with descriptions of your problem.

Department of Computer Science

Technical Services and Support

Cuda and AI/Learning Tools