Running Jobs on GPU Nodes

Running Jobs on GPU Nodes#

This guide covers how to run GPU-accelerated jobs on NMTHPC’s NVIDIA H100 GPU nodes.

GPU Hardware Overview#

NMTHPC features:

2 GPU nodes
NVIDIA H100 GPUs
High-bandwidth GPU memory
NVLink or PCIe connectivity
CUDA-capable architecture

Requesting GPU Resources#

Interactive GPU Session#

Request a single GPU:

$ srun --partition=gpu --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash

After allocation, verify GPU access:

$ nvidia-smi

You should see information about the allocated GPU.

Batch GPU Job#

Basic GPU job script:

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --output=gpu_job_%j.out

module load cuda/12.1

# Run GPU program
./my_gpu_program

Requesting Multiple GPUs#

Single node, multiple GPUs:

#SBATCH --gres=gpu:2      # Request 2 GPUs on one node
#SBATCH --mem=64G          # More memory for multi-GPU

Multi-node GPU jobs (if supported):

#SBATCH --nodes=2
#SBATCH --gres=gpu:2       # 2 GPUs per node = 4 GPUs total
#SBATCH --ntasks-per-node=2

Note

Check with HPC support for multi-node GPU capabilities and configuration on NMTHPC.

GPU Monitoring#

Check GPU Status#

View GPU information:

$ nvidia-smi

Output includes:

GPU model and driver version
Memory usage (used/total)
GPU utilization percentage
Running processes
Temperature and power

Continuous Monitoring#

Update every 2 seconds:

$ watch -n 2 nvidia-smi

Monitor specific metrics:

$ nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1

Process-Level Monitoring#

Show GPU processes:

$ nvidia-smi pmon

GPU utilization over time:

$ nvidia-smi dmon

CUDA Programming#

Loading CUDA#

Load CUDA toolkit:

$ module load cuda/12.1

Verify CUDA installation:

$ nvcc --version
$ which nvcc

Compiling CUDA Code#

Simple CUDA compilation:

$ module load cuda/12.1
$ nvcc -o my_program my_program.cu

With optimization:

$ nvcc -O3 -arch=sm_90 -o my_program my_program.cu

Note

H100 GPUs use compute capability 9.0 (sm_90). Check CUDA documentation for the exact architecture flag.

CUDA Job Script Example#

#!/bin/bash
#SBATCH --job-name=cuda_test
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=01:00:00

module load cuda/12.1

# Compile
nvcc -o vector_add vector_add.cu

# Run
./vector_add

# Check GPU was used
nvidia-smi

Deep Learning Frameworks#

PyTorch#

Interactive PyTorch session:

$ srun --partition=gpu --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash
$ module load cuda/12.1
$ module load python/3.11
$ python
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.device_count())
1
>>> print(torch.cuda.get_device_name(0))
NVIDIA H100

PyTorch batch job:

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --output=pytorch_%j.out

module load cuda/12.1
module load python/3.11

# Or use conda environment
# module load anaconda3
# source activate pytorch_env

python train.py --epochs 100 --batch-size 64

TensorFlow#

TensorFlow job script:

#!/bin/bash
#SBATCH --job-name=tensorflow_train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=48:00:00

module load cuda/12.1
module load python/3.11

# Verify GPU availability
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# Run training
python train_model.py

Multi-GPU Training#

PyTorch DataParallel:

import torch
import torch.nn as nn

# Model
model = MyModel()

# Use all available GPUs
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model = model.cuda()

SLURM script for multi-GPU:

#!/bin/bash
#SBATCH --job-name=multi_gpu
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=48:00:00

module load cuda/12.1
module load python/3.11

python train_multi_gpu.py

Optimizing GPU Usage#

Check GPU Utilization#

While your job runs, SSH to the compute node and check:

$ nvidia-smi

Look for:

GPU-Util: Should be high (>80%) for compute-bound tasks
Memory-Usage: Ensure you’re not exceeding GPU memory
Processes: Verify your process is using the GPU

Common Issues and Solutions#

Low GPU utilization (<30%):

Possible causes:

CPU bottleneck (increase --cpus-per-task)
I/O bottleneck (optimize data loading)
Small batch size (increase batch size)
Data transfer overhead (use pinned memory, prefetching)

Out of GPU memory:

Solutions:

Reduce batch size
Use gradient accumulation
Enable mixed precision training
Use gradient checkpointing
Request multiple GPUs and distribute model

GPU not being used:

Check:

Code actually uses GPU (check with nvidia-smi)
CUDA is loaded
GPU-enabled version of software is loaded
Code detects GPU correctly

GPU Job Arrays#

Run multiple GPU jobs as an array:

#!/bin/bash
#SBATCH --job-name=gpu_array
#SBATCH --output=logs/gpu_%A_%a.out
#SBATCH --array=1-10
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=08:00:00

module load cuda/12.1
module load python/3.11

# Each task trains with different hyperparameters
CONFIG="config_${SLURM_ARRAY_TASK_ID}.yaml"

python train.py --config $CONFIG --gpu 0

See Using SLURM Job Arrays for more on job arrays.

Best Practices#

Resource Requests#

CPU cores: Request enough CPUs for data preprocessing

#SBATCH --cpus-per-task=8  # For data loading, preprocessing

Typically use 4-8 CPUs per GPU.

Memory: Request sufficient system RAM

#SBATCH --mem=64G  # System RAM, not GPU memory

GPU memory is fixed by hardware and doesn’t need to be requested.

Time limits: GPU time is precious

Test with short time limits first
Request realistic time + 20% buffer
Don’t request maximum time “just in case”

Code Optimization#

1. Batch size: Maximize GPU memory usage

# Increase batch size until GPU memory is ~90% full
batch_size = 64  # Tune this

2. Data loading: Don’t bottleneck on CPU

# PyTorch example
dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=8,  # Match --cpus-per-task
    pin_memory=True  # Faster GPU transfer
)

3. Minimize CPU-GPU transfers:

# Keep data on GPU when possible
data = data.cuda()
# Reuse GPU buffers

4. Use built-in GPU operations:

# Good: GPU-optimized
torch.matmul(a, b)

# Bad: CPU fallback
numpy.matmul(a.cpu().numpy(), b.cpu().numpy())

Monitoring During Development#

Add GPU logging to your code:

import torch

print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

# During training
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Troubleshooting#

GPU Not Detected#

Check CUDA is loaded:

$ module list
$ echo $CUDA_HOME

Check GPU allocation:

$ nvidia-smi

If no GPU shown, you’re not on a GPU node or GPU wasn’t allocated.

CUDA Out of Memory#

Error: RuntimeError: CUDA out of memory

Solutions:

Reduce batch size:
```
batch_size = 32  # Reduce from 64
```
Clear GPU cache (PyTorch):
```
torch.cuda.empty_cache()
```

Use gradient accumulation:

accumulation_steps = 4
for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Request multiple GPUs and distribute model

Slow Training#

Check GPU utilization:

$ nvidia-smi dmon

If GPU util < 80%:

Increase batch size

Add more data loading workers:

#SBATCH --cpus-per-task=16  # More CPUs for data loading

Profile your code to find bottlenecks

Use profilers:

# PyTorch profiler
with torch.profiler.profile() as prof:
    model(input_batch)
print(prof.key_averages().table())

Job Pending Long Time#

GPU nodes are in high demand:

Check wait reason:

$ squeue -u $USER -o "%.18i %.30j %.20R"

Reduce wait time:

Request fewer GPUs
Request shorter time
Submit during off-peak hours
Use job arrays with % limiter

Example: Complete GPU Training Workflow#

#!/bin/bash
#SBATCH --job-name=model_training
#SBATCH --output=logs/train_%j.out
#SBATCH --error=logs/train_%j.err
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=you@nmt.edu

# Exit on error
set -e

# Print job info
echo "Job started on $(hostname) at $(date)"
echo "Job ID: $SLURM_JOB_ID"
nvidia-smi

# Load modules
module purge
module load cuda/12.1
module load python/3.11

# Activate conda environment
source activate ml_env

# Verify GPU
python -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"

# Run training
python train.py \
    --data-dir /path/to/data \
    --output-dir results/$SLURM_JOB_ID \
    --epochs 100 \
    --batch-size 64 \
    --learning-rate 0.001 \
    --workers $SLURM_CPUS_PER_TASK

# Final GPU stats
nvidia-smi

echo "Job completed at $(date)"

Additional Resources#

Anaconda

Questions?#

For questions about GPU computing on NMTHPC, contact hpc@nmthpc.atlassian.net.