Monitoring Resources#
This guide covers tools and commands for monitoring your jobs, system resources, and resource usage on NMTHPC.
Monitoring Your Jobs#
Viewing Job Queue#
See all your jobs:
$ squeue -u $USER
Output columns:
JOBID: Unique job identifierPARTITION: Queue/partition where job is runningNAME: Job nameUSER: Your usernameST: Job state (R=Running, PD=Pending, CG=Completing)TIME: Time job has been runningNODES: Number of nodes allocatedNODELIST: Which nodes are allocated
See specific job:
$ squeue -j JOBID
See all jobs (all users):
$ squeue
Customize output format:
$ squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Job Status Codes#
Common status codes in the ST column:
Code |
State |
Meaning |
|---|---|---|
PD |
Pending |
Job is waiting for resources |
R |
Running |
Job is currently running |
CG |
Completing |
Job is in the process of completing |
CD |
Completed |
Job has completed successfully |
F |
Failed |
Job terminated with non-zero exit code |
TO |
Timeout |
Job reached time limit |
OOM |
Out of Memory |
Job exceeded memory limit |
Detailed Job Information#
Current job details:
$ scontrol show job JOBID
This shows comprehensive information including:
Requested resources
Allocated nodes
Time limits
Working directory
Job state reason
Why is my job pending?:
$ squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.10l %.6D %.20R"
The REASON column shows why a job is pending:
Resources: Waiting for requested resourcesPriority: Other jobs have higher priorityQOSMaxCpuPerUserLimit: You’ve hit your CPU limitQOSMaxJobsPerUserLimit: You’ve hit your job limit
Job History and Accounting#
View completed jobs (last 24 hours):
$ sacct
Specific job details:
$ sacct -j JOBID --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize
Useful sacct formats:
$ sacct -j JOBID --format=JobID,JobName,Start,End,Elapsed,State,ExitCode
$ sacct -j JOBID --format=JobID,MaxRSS,MaxVMSize,AveCPU,TotalCPU
View jobs from specific date range:
$ sacct --starttime=2024-01-01 --endtime=2024-01-31 -u $USER
Format codes:
MaxRSS: Maximum memory usedMaxVMSize: Maximum virtual memoryElapsed: Total runtimeTotalCPU: Total CPU time usedState: Final state of the job
GPU Monitoring#
Check GPU status:
$ nvidia-smi
Continuous monitoring (updates every 2 seconds):
$ watch -n 2 nvidia-smi
GPU utilization details:
$ nvidia-smi --query-gpu=timestamp,gpu_name,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv
Monitor specific process:
$ nvidia-smi pmon
System-Wide Monitoring#
Cluster Status#
View partition information:
$ sinfo
Output columns:
PARTITION: Queue nameAVAIL: Partition availabilityTIMELIMIT: Maximum job timeNODES: Number of nodesSTATE: Node statesNODELIST: List of nodes
Detailed node information:
$ sinfo -Nel
Show only available nodes:
$ sinfo -t idle
Node Details#
Information about specific node:
$ scontrol show node nodeXXX
All nodes in partition:
$ scontrol show partition partitionname
Storage Monitoring#
Check Disk Quota#
Your quota usage:
$ quota -s
Detailed filesystem usage:
$ df -h
Disk Usage#
Home directory usage:
$ du -sh ~/
Usage by subdirectory:
$ du -h --max-depth=1 ~/ | sort -h
Find large files:
$ find ~/ -type f -size +1G -exec ls -lh {} \;
Largest files in directory:
$ du -ah ~/ | sort -rh | head -20
Job Output Files#
SLURM Output Files#
By default, SLURM creates output files:
slurm-JOBID.out: Combined stdout and stderr
Custom output files (in your job script):
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err
View output while job runs:
$ tail -f slurm-JOBID.out
Search output for errors:
$ grep -i error slurm-JOBID.out
Email Notifications#
Get email alerts about job status:
Add to your SLURM script:
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your.email@nmt.edu
Mail types:
BEGIN: Job startsEND: Job completesFAIL: Job failsALL: All events
Job Efficiency#
Analyzing Resource Usage#
After job completes:
$ seff JOBID
This shows:
CPU efficiency
Memory efficiency
Time used vs. requested
Example output:
Job ID: 12345
Cluster: nmthpc
User/Group: username/group
State: COMPLETED
Cores: 4
CPU Utilized: 03:45:22
CPU Efficiency: 93.84%
Memory Utilized: 12.5 GB
Memory Efficiency: 62.50%
Improving Efficiency#
Based on seff output:
Low CPU efficiency:
Your code may not be parallelized properly
You requested more cores than your code can use
Reduce core count or improve parallelization
Low memory efficiency:
You requested too much memory
Reduce
--memin future jobs to save resources
High memory usage:
Increase
--memto avoid out-of-memory errorsConsider using high-memory nodes if needed
Custom Monitoring Scripts#
Simple Status Check Script#
monitor_job.sh:
#!/bin/bash
JOBID=$1
echo "Job Status:"
squeue -j $JOBID
echo -e "\nResource Usage:"
sacct -j $JOBID --format=JobID,JobName,Elapsed,State,MaxRSS,MaxVMSize
echo -e "\nOutput tail:"
if [ -f "slurm-${JOBID}.out" ]; then
tail -20 slurm-${JOBID}.out
fi
Usage:
$ chmod +x monitor_job.sh
$ ./monitor_job.sh JOBID
Watch Multiple Jobs#
watch_jobs.sh:
#!/bin/bash
watch -n 10 "squeue -u $USER -o '%.10i %.12j %.8T %.10M %.4D %R'"
Troubleshooting#
Job Not Starting#
Check why job is pending:
$ squeue -u $USER -o "%.18i %.30j %.20R"
Common reasons:
Insufficient resources available
Requested more resources than partition has
Hit job or resource limits
Partition down for maintenance
Job Killed Unexpectedly#
Check job status:
$ sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode
Common causes:
OUT_OF_MEMORY: Requested insufficient memoryTIMEOUT: Job exceeded time limitFAILED: Non-zero exit codeCANCELLED: You or admin cancelled it
High Memory Usage#
Identify memory-intensive jobs:
$ sacct -S 2024-01-01 -u $USER --format=JobID,JobName,MaxRSS,State | grep -v "batch\|extern"
Monitor memory in real-time:
SSH to compute node and run:
$ watch -n 5 'ps aux --sort=-%mem | head -20'
Best Practices#
Before Submitting Large Jobs#
Test with small jobs first
Check available resources:
sinfoRequest appropriate resources based on testing
Set realistic time limits with buffer
During Job Execution#
Monitor initial progress: Check job starts correctly
Verify resource usage: Ensure not wasting resources
Watch for errors: Check output files periodically
After Job Completion#
Check efficiency: Use
seff JOBIDReview output: Look for errors or warnings
Adjust future jobs: Based on actual usage
Clean up: Remove unnecessary output files
Useful Aliases#
Add to your ~/.bashrc:
# Job monitoring aliases
alias myq='squeue -u $USER'
alias myjobs='sacct --format=JobID,JobName,Partition,State,Elapsed,MaxRSS'
alias nodes='sinfo -Nel'
alias checkquota='quota -s'
Reload:
$ source ~/.bashrc
Summary of Key Commands#
Task |
Command |
|---|---|
View your jobs |
|
Job details |
|
Job history |
|
Job efficiency |
|
GPU monitoring |
|
Disk quota |
|
Cluster status |
|
Disk usage |
|
Questions?#
For questions about monitoring jobs or resource usage, contact hpc@nmthpc.atlassian.net.