GPU Infrastructure for Education

Info

As scientific fields such as Machine Learning, Deep Learning, Advanced Data Processing, Data Science and other AI subfields evolve at a fast pace, so should educational practices evolve too. The GPU4EDU project aims to provide new generations of students with access to modern facilities that allow them to gain up-to-date knowledge and progressive experience in the field of AI. That is, the aim of the GPU4EDU project is to bring the technical equipment at Tilburg University, Tilburg School of Humanities and Digital Sciences (TSHD) to a level that matches the requirements for current and future AI education. This project expands the hardware fleet of the School of Humanities and Digital Sciences for the educational needs of the Department of Cognitive Science and Artificial Intelligence with high-end, multi-GPU servers, accessible remotely and securely. Furthermore, these servers are dedicated to the educational needs of the students in courses taught within the CSAI department.

DL is a subfield of artificial intelligence that focuses on modeling non-linear problems using artificial neural networks. DL exploits complex neural network architectures which require substantial compute power. To deal with this complexity, a graphics processing unit (GPU) is more suitable than a central processing unit (CPU). This is largely because of the much higher number of cores in comparison to a CPU and substantial volume of dedicated memory that GPUs have. Currently, researchers that work on or with DL models rely heavily on GPU machines. Typically these are shared servers rather than personal workstations that are dedicated solely to working with DL models. Up till now, the available hardware dedicated to education did not include GPU servers. Students were required to resort to google cloud, other limited platforms or paid solutions. The GPU4EDU project has ensured a limited number of machines for student use and is a first step towards a larger educational infrastructure.

Tutorial

Account
- Generate a public ssh key ( check https://doc.uvt.nl/ssh/ )
- Send the public ssh key and your user name to: d.shterionov@tilburguniversity.edu and lis-unix@uvt.nl via the secure file sender by SURFSara https://filesender.surf.nl/ with subject GPU4EDU 2024/2025 Sem2. Public keys that are not sent through SURFSara will not be accepted.

Access aurometalsaurus.uvt.nl

$ ssh [USERNAME]@aurometalsaurus.uvt.nl

where [USERNAME] should be replaced with your u-number.

Access a node:

$ srun --nodes=1 --pty /bin/bash -l
Setup an anaconda environment
1. Check if you can run anaconda:
  
  $ which conda
  
  if that is not the case, then do the following:
  
  $ nano ~/.bashrc
  
  scroll down to the end of the file and add the following line
  # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/usr/local/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/usr/local/anaconda3/etc/profile.d/conda.sh" ]; then . "/usr/local/anaconda3/etc/profile.d/conda.sh" else export PATH="/usr/local/anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<<
2. Create an environment: $ conda create -n [ENVNAME]
3. Intall required software:
  - Navigate to https://anaconda.org/search and search for the tools you require, then copy-paste the installation command
  - You can use conda or pip
4. Do some small tests to ensure everything works You can run code on the machine that you are connected to in order to ensure that all software works.
Create your job
- Once you are done with testing, you can submit your job.
- Create a shell script with a name you associate with your project. Let’s say `sentiment_analysis_1.sh'. A template is here.
- Add the following lines at the top of your script:
  #!/bin/bash #SBATCH -p GPU # partition (queue) #SBATCH -N 1 # number of nodes #SBATCH -t 0-36:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR #SBATCH --gres=gpu:1
  These are settings for SLURM. For more information on SLURM and tutorials, check here: SLURM
- If you the script you make using the above header cannot access your conda environment, please add the following at the end of this head and before any code you want to write:
  if [ -f "/usr/local/anaconda3/etc/profile.d/conda.sh" ]; then . "/usr/local/anaconda3/etc/profile.d/conda.sh" else export PATH="/usr/local/anaconda3/bin:$PATH" fi
- After that, add a command to activate the conda environment you created:
  
  source activate [ENVNAME]
- Then, add the commands that will invoke your project’s execution. For example, navigate to the directory where your scrips need to be executed.
- Always double check whether the path is correct.
Submit your job
- Once you are done with preparing your script, submit it:
  
  sbatch [MYSCRIPT]
  
  where [MYSCRIPT] can be sentiment_analysis_1.sh as per the above example.
- While the script is running it will generate two files - slurm.[NODE].[JOBNUMBER].err and slurm.[NODE].[JOBNUMBER].out, where [NODE] and [JOBNUMBER] are the name of the node which is running your script and the number of the job that has been submitted, respectively. Here is an example: slurm.cerulean.118.err slurm.cerulean.118.out The .err file contains logging information such as warnings and errors; the .out file contains the actual output that you have asked your script to generate.
- Watch out! Graphics will not be displayed, but can be saved.
Copy data: You can copy large or small files via scp or with FileZilla (download the Client for your OS).
Enjoy #GPU4EDU!

Practical AI seminars

Google chat (for Q&A)

Google Chat

Monitoring your energy consumption and carbon footprint

There are many tools to monitor the resource consumption. Typically these monitor RAM and HD consumption, CPU and GPU utilisation and power draw. Then, using some math, convert it into more interpretable data, e.g. CO2eq emissions. Below are some of these tools that you can use, focusing on monitoring the energy consumption of GPUs. Note that some of them use the RAPL library to trace the CPU power draw. RAPL requires super user privilleges (which we don’t have on our GPU4EDU) and therefore would not work! So, we advise you to focus on the GPU energy consumption.

Carbon Tracker RECOMMENDED

You can use it in two ways:

From your shell console, which will both run your script and monitor the power consumption; it will give you an estimate on the complete script you are running (whatever happens inside):

$ carbontracker [SCRIPTNAME]

or using it within your python code:

from carbontracker.tracker import CarbonTracker
tracker = CarbonTracker(epochs=[NUMEPOCHS], components="gpu", log_dir=[DIR_TO_STORE_OUTPUTS])

where [NUMEPOCHS] is the number of epochs or iterations that you plan to run. If you want to run your script for 100 epochs, but monitor only the first 10 you can use this option to limit the monitoring. And [DIR_TO_STORE_OUTPUTS] should be replaced with a directory where you want to save the outputs of the tracker - a log file that looks like this:

2025-02-25 14:09:18 - CarbonTracker: The following components were found: GPU with device(s) NVIDIA A40.
2025-02-25 14:09:20 - CarbonTracker: Current carbon intensity is 324.00 gCO2/kWh at detected location: Tilburg, North Brabant, NL.
2025-02-25 14:09:20 - CarbonTracker:
Predicted consumption for 9 epoch(s):
	Time:	0:00:17
	Energy:	0.000170765282 kWh
	CO2eq:	0.055327951403 g
	This is equivalent to:
	0.000514678618 km travelled by car

You are encouradged to place this information in your theses or reports.

NVIDIA-SMI The Nvidia-smi monitoring tool can be used to collect information about the power consumption of each or all GPUs.

nvidia-smi dmon -i 0 -s mpucv -d 1 -o TD > gpu.log &

Bear in mind that this command will lunch a process running on the background and will not terminate it except if explicitly noted to do so. So, what you can do is:

echo "STARTING NVIDIA DMON"
nvidia-smi dmon -i 0 -s mpucv -d 1 -o TD > gpu.log &
A="$!"
echo "STARTING EXPERIMENT"
[MY_EXPERIMENT_CALL]
kill $A
echo "DONE"

Where [MY_EXPERIMENT_CALL] should be replaced with the call to your script which you want to monitor. E.g. if I want to call my python script to train a model, I would probably have something like python train.py to replace the [MY_EXPERIMENT_CALL] variable.

Impact estimator This is a platform where you can put the number of hours you ran experiments and the type of hardware and get an estimate of the carbon emissions.

There are many more similar projects that can monitor your experiment impact, e.g. Experiment Impact Tracker, Power API, Intel Power Gadget, Windows-only and others.