Monitoring Resources#

This guide covers tools and commands for monitoring your jobs, system resources, and resource usage on NMTHPC.

Monitoring Your Jobs#

Viewing Job Queue#

See all your jobs:

$ squeue -u $USER

Output columns:

  • JOBID: Unique job identifier

  • PARTITION: Queue/partition where job is running

  • NAME: Job name

  • USER: Your username

  • ST: Job state (R=Running, PD=Pending, CG=Completing)

  • TIME: Time job has been running

  • NODES: Number of nodes allocated

  • NODELIST: Which nodes are allocated

See specific job:

$ squeue -j JOBID

See all jobs (all users):

$ squeue

Customize output format:

$ squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Job Status Codes#

Common status codes in the ST column:

Code

State

Meaning

PD

Pending

Job is waiting for resources

R

Running

Job is currently running

CG

Completing

Job is in the process of completing

CD

Completed

Job has completed successfully

F

Failed

Job terminated with non-zero exit code

TO

Timeout

Job reached time limit

OOM

Out of Memory

Job exceeded memory limit

Detailed Job Information#

Current job details:

$ scontrol show job JOBID

This shows comprehensive information including:

  • Requested resources

  • Allocated nodes

  • Time limits

  • Working directory

  • Job state reason

Why is my job pending?:

$ squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.10l %.6D %.20R"

The REASON column shows why a job is pending:

  • Resources: Waiting for requested resources

  • Priority: Other jobs have higher priority

  • QOSMaxCpuPerUserLimit: You’ve hit your CPU limit

  • QOSMaxJobsPerUserLimit: You’ve hit your job limit

Job History and Accounting#

View completed jobs (last 24 hours):

$ sacct

Specific job details:

$ sacct -j JOBID --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize

Useful sacct formats:

$ sacct -j JOBID --format=JobID,JobName,Start,End,Elapsed,State,ExitCode
$ sacct -j JOBID --format=JobID,MaxRSS,MaxVMSize,AveCPU,TotalCPU

View jobs from specific date range:

$ sacct --starttime=2024-01-01 --endtime=2024-01-31 -u $USER

Format codes:

  • MaxRSS: Maximum memory used

  • MaxVMSize: Maximum virtual memory

  • Elapsed: Total runtime

  • TotalCPU: Total CPU time used

  • State: Final state of the job

GPU Monitoring#

Check GPU status:

$ nvidia-smi

Continuous monitoring (updates every 2 seconds):

$ watch -n 2 nvidia-smi

GPU utilization details:

$ nvidia-smi --query-gpu=timestamp,gpu_name,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv

Monitor specific process:

$ nvidia-smi pmon

System-Wide Monitoring#

Cluster Status#

View partition information:

$ sinfo

Output columns:

  • PARTITION: Queue name

  • AVAIL: Partition availability

  • TIMELIMIT: Maximum job time

  • NODES: Number of nodes

  • STATE: Node states

  • NODELIST: List of nodes

Detailed node information:

$ sinfo -Nel

Show only available nodes:

$ sinfo -t idle

Node Details#

Information about specific node:

$ scontrol show node nodeXXX

All nodes in partition:

$ scontrol show partition partitionname

Storage Monitoring#

Check Disk Quota#

Your quota usage:

$ quota -s

Detailed filesystem usage:

$ df -h

Disk Usage#

Home directory usage:

$ du -sh ~/

Usage by subdirectory:

$ du -h --max-depth=1 ~/ | sort -h

Find large files:

$ find ~/ -type f -size +1G -exec ls -lh {} \;

Largest files in directory:

$ du -ah ~/ | sort -rh | head -20

Job Output Files#

SLURM Output Files#

By default, SLURM creates output files:

  • slurm-JOBID.out: Combined stdout and stderr

Custom output files (in your job script):

#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err

View output while job runs:

$ tail -f slurm-JOBID.out

Search output for errors:

$ grep -i error slurm-JOBID.out

Email Notifications#

Get email alerts about job status:

Add to your SLURM script:

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your.email@nmt.edu

Mail types:

  • BEGIN: Job starts

  • END: Job completes

  • FAIL: Job fails

  • ALL: All events

Job Efficiency#

Analyzing Resource Usage#

After job completes:

$ seff JOBID

This shows:

  • CPU efficiency

  • Memory efficiency

  • Time used vs. requested

Example output:

Job ID: 12345
Cluster: nmthpc
User/Group: username/group
State: COMPLETED
Cores: 4
CPU Utilized: 03:45:22
CPU Efficiency: 93.84%
Memory Utilized: 12.5 GB
Memory Efficiency: 62.50%

Improving Efficiency#

Based on seff output:

Low CPU efficiency:

  • Your code may not be parallelized properly

  • You requested more cores than your code can use

  • Reduce core count or improve parallelization

Low memory efficiency:

  • You requested too much memory

  • Reduce --mem in future jobs to save resources

High memory usage:

  • Increase --mem to avoid out-of-memory errors

  • Consider using high-memory nodes if needed

Custom Monitoring Scripts#

Simple Status Check Script#

monitor_job.sh:

#!/bin/bash
JOBID=$1

echo "Job Status:"
squeue -j $JOBID

echo -e "\nResource Usage:"
sacct -j $JOBID --format=JobID,JobName,Elapsed,State,MaxRSS,MaxVMSize

echo -e "\nOutput tail:"
if [ -f "slurm-${JOBID}.out" ]; then
    tail -20 slurm-${JOBID}.out
fi

Usage:

$ chmod +x monitor_job.sh
$ ./monitor_job.sh JOBID

Watch Multiple Jobs#

watch_jobs.sh:

#!/bin/bash
watch -n 10 "squeue -u $USER -o '%.10i %.12j %.8T %.10M %.4D %R'"

Troubleshooting#

Job Not Starting#

Check why job is pending:

$ squeue -u $USER -o "%.18i %.30j %.20R"

Common reasons:

  • Insufficient resources available

  • Requested more resources than partition has

  • Hit job or resource limits

  • Partition down for maintenance

Job Killed Unexpectedly#

Check job status:

$ sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode

Common causes:

  • OUT_OF_MEMORY: Requested insufficient memory

  • TIMEOUT: Job exceeded time limit

  • FAILED: Non-zero exit code

  • CANCELLED: You or admin cancelled it

High Memory Usage#

Identify memory-intensive jobs:

$ sacct -S 2024-01-01 -u $USER --format=JobID,JobName,MaxRSS,State | grep -v "batch\|extern"

Monitor memory in real-time:

SSH to compute node and run:

$ watch -n 5 'ps aux --sort=-%mem | head -20'

Best Practices#

Before Submitting Large Jobs#

  1. Test with small jobs first

  2. Check available resources: sinfo

  3. Request appropriate resources based on testing

  4. Set realistic time limits with buffer

During Job Execution#

  1. Monitor initial progress: Check job starts correctly

  2. Verify resource usage: Ensure not wasting resources

  3. Watch for errors: Check output files periodically

After Job Completion#

  1. Check efficiency: Use seff JOBID

  2. Review output: Look for errors or warnings

  3. Adjust future jobs: Based on actual usage

  4. Clean up: Remove unnecessary output files

Useful Aliases#

Add to your ~/.bashrc:

# Job monitoring aliases
alias myq='squeue -u $USER'
alias myjobs='sacct --format=JobID,JobName,Partition,State,Elapsed,MaxRSS'
alias nodes='sinfo -Nel'
alias checkquota='quota -s'

Reload:

$ source ~/.bashrc

Summary of Key Commands#

Task

Command

View your jobs

squeue -u $USER

Job details

scontrol show job JOBID

Job history

sacct -j JOBID

Job efficiency

seff JOBID

GPU monitoring

nvidia-smi

Disk quota

quota -s

Cluster status

sinfo

Disk usage

du -sh ~/

Questions?#

For questions about monitoring jobs or resource usage, contact hpc@nmthpc.atlassian.net.