Running Jobs on GPU Nodes#
This guide covers how to run GPU-accelerated jobs on NMTHPC’s NVIDIA H100 GPU nodes.
GPU Hardware Overview#
NMTHPC features:
2 GPU nodes
NVIDIA H100 GPUs
High-bandwidth GPU memory
NVLink or PCIe connectivity
CUDA-capable architecture
Requesting GPU Resources#
Interactive GPU Session#
Request a single GPU:
$ srun --partition=gpu --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash
After allocation, verify GPU access:
$ nvidia-smi
You should see information about the allocated GPU.
Batch GPU Job#
Basic GPU job script:
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --output=gpu_job_%j.out
module load cuda/12.1
# Run GPU program
./my_gpu_program
Requesting Multiple GPUs#
Single node, multiple GPUs:
#SBATCH --gres=gpu:2 # Request 2 GPUs on one node
#SBATCH --mem=64G # More memory for multi-GPU
Multi-node GPU jobs (if supported):
#SBATCH --nodes=2
#SBATCH --gres=gpu:2 # 2 GPUs per node = 4 GPUs total
#SBATCH --ntasks-per-node=2
Note
Check with HPC support for multi-node GPU capabilities and configuration on NMTHPC.
GPU Monitoring#
Check GPU Status#
View GPU information:
$ nvidia-smi
Output includes:
GPU model and driver version
Memory usage (used/total)
GPU utilization percentage
Running processes
Temperature and power
Continuous Monitoring#
Update every 2 seconds:
$ watch -n 2 nvidia-smi
Monitor specific metrics:
$ nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1
Process-Level Monitoring#
Show GPU processes:
$ nvidia-smi pmon
GPU utilization over time:
$ nvidia-smi dmon
CUDA Programming#
Loading CUDA#
Load CUDA toolkit:
$ module load cuda/12.1
Verify CUDA installation:
$ nvcc --version
$ which nvcc
Compiling CUDA Code#
Simple CUDA compilation:
$ module load cuda/12.1
$ nvcc -o my_program my_program.cu
With optimization:
$ nvcc -O3 -arch=sm_90 -o my_program my_program.cu
Note
H100 GPUs use compute capability 9.0 (sm_90). Check CUDA documentation for the exact architecture flag.
CUDA Job Script Example#
#!/bin/bash
#SBATCH --job-name=cuda_test
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=01:00:00
module load cuda/12.1
# Compile
nvcc -o vector_add vector_add.cu
# Run
./vector_add
# Check GPU was used
nvidia-smi
Deep Learning Frameworks#
PyTorch#
Interactive PyTorch session:
$ srun --partition=gpu --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash
$ module load cuda/12.1
$ module load python/3.11
$ python
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.device_count())
1
>>> print(torch.cuda.get_device_name(0))
NVIDIA H100
PyTorch batch job:
#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --output=pytorch_%j.out
module load cuda/12.1
module load python/3.11
# Or use conda environment
# module load anaconda3
# source activate pytorch_env
python train.py --epochs 100 --batch-size 64
TensorFlow#
TensorFlow job script:
#!/bin/bash
#SBATCH --job-name=tensorflow_train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=48:00:00
module load cuda/12.1
module load python/3.11
# Verify GPU availability
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Run training
python train_model.py
Multi-GPU Training#
PyTorch DataParallel:
import torch
import torch.nn as nn
# Model
model = MyModel()
# Use all available GPUs
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs")
model = nn.DataParallel(model)
model = model.cuda()
SLURM script for multi-GPU:
#!/bin/bash
#SBATCH --job-name=multi_gpu
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=48:00:00
module load cuda/12.1
module load python/3.11
python train_multi_gpu.py
Optimizing GPU Usage#
Check GPU Utilization#
While your job runs, SSH to the compute node and check:
$ nvidia-smi
Look for:
GPU-Util: Should be high (>80%) for compute-bound tasks
Memory-Usage: Ensure you’re not exceeding GPU memory
Processes: Verify your process is using the GPU
Common Issues and Solutions#
Low GPU utilization (<30%):
Possible causes:
CPU bottleneck (increase
--cpus-per-task)I/O bottleneck (optimize data loading)
Small batch size (increase batch size)
Data transfer overhead (use pinned memory, prefetching)
Out of GPU memory:
Solutions:
Reduce batch size
Use gradient accumulation
Enable mixed precision training
Use gradient checkpointing
Request multiple GPUs and distribute model
GPU not being used:
Check:
Code actually uses GPU (check with
nvidia-smi)CUDA is loaded
GPU-enabled version of software is loaded
Code detects GPU correctly
GPU Job Arrays#
Run multiple GPU jobs as an array:
#!/bin/bash
#SBATCH --job-name=gpu_array
#SBATCH --output=logs/gpu_%A_%a.out
#SBATCH --array=1-10
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=08:00:00
module load cuda/12.1
module load python/3.11
# Each task trains with different hyperparameters
CONFIG="config_${SLURM_ARRAY_TASK_ID}.yaml"
python train.py --config $CONFIG --gpu 0
See Using SLURM Job Arrays for more on job arrays.
Best Practices#
Resource Requests#
CPU cores: Request enough CPUs for data preprocessing
#SBATCH --cpus-per-task=8 # For data loading, preprocessing
Typically use 4-8 CPUs per GPU.
Memory: Request sufficient system RAM
#SBATCH --mem=64G # System RAM, not GPU memory
GPU memory is fixed by hardware and doesn’t need to be requested.
Time limits: GPU time is precious
Test with short time limits first
Request realistic time + 20% buffer
Don’t request maximum time “just in case”
Code Optimization#
1. Batch size: Maximize GPU memory usage
# Increase batch size until GPU memory is ~90% full
batch_size = 64 # Tune this
2. Data loading: Don’t bottleneck on CPU
# PyTorch example
dataloader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=8, # Match --cpus-per-task
pin_memory=True # Faster GPU transfer
)
3. Minimize CPU-GPU transfers:
# Keep data on GPU when possible
data = data.cuda()
# Reuse GPU buffers
4. Use built-in GPU operations:
# Good: GPU-optimized
torch.matmul(a, b)
# Bad: CPU fallback
numpy.matmul(a.cpu().numpy(), b.cpu().numpy())
Monitoring During Development#
Add GPU logging to your code:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
# During training
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
Troubleshooting#
GPU Not Detected#
Check CUDA is loaded:
$ module list
$ echo $CUDA_HOME
Check GPU allocation:
$ nvidia-smi
If no GPU shown, you’re not on a GPU node or GPU wasn’t allocated.
CUDA Out of Memory#
Error: RuntimeError: CUDA out of memory
Solutions:
Reduce batch size:
batch_size = 32 # Reduce from 64
Clear GPU cache (PyTorch):
torch.cuda.empty_cache()
Use gradient accumulation:
accumulation_steps = 4 for i, (data, target) in enumerate(dataloader): output = model(data) loss = criterion(output, target) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
Request multiple GPUs and distribute model
Slow Training#
Check GPU utilization:
$ nvidia-smi dmon
If GPU util < 80%:
Increase batch size
Add more data loading workers:
#SBATCH --cpus-per-task=16 # More CPUs for data loadingProfile your code to find bottlenecks
Use profilers:
# PyTorch profiler
with torch.profiler.profile() as prof:
model(input_batch)
print(prof.key_averages().table())
Job Pending Long Time#
GPU nodes are in high demand:
Check wait reason:
$ squeue -u $USER -o "%.18i %.30j %.20R"
Reduce wait time:
Request fewer GPUs
Request shorter time
Submit during off-peak hours
Use job arrays with
%limiter
Example: Complete GPU Training Workflow#
#!/bin/bash
#SBATCH --job-name=model_training
#SBATCH --output=logs/train_%j.out
#SBATCH --error=logs/train_%j.err
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=you@nmt.edu
# Exit on error
set -e
# Print job info
echo "Job started on $(hostname) at $(date)"
echo "Job ID: $SLURM_JOB_ID"
nvidia-smi
# Load modules
module purge
module load cuda/12.1
module load python/3.11
# Activate conda environment
source activate ml_env
# Verify GPU
python -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"
# Run training
python train.py \
--data-dir /path/to/data \
--output-dir results/$SLURM_JOB_ID \
--epochs 100 \
--batch-size 64 \
--learning-rate 0.001 \
--workers $SLURM_CPUS_PER_TASK
# Final GPU stats
nvidia-smi
echo "Job completed at $(date)"
Additional Resources#
Questions?#
For questions about GPU computing on NMTHPC, contact hpc@nmthpc.atlassian.net.