Running Batch Jobs and SLURM Basics#

Batch jobs are the primary way to run computational work on NMTHPC. This guide covers SLURM batch job basics and best practices.

What are Batch Jobs?#

Batch jobs:

  • Run without user interaction

  • Are queued and run when resources are available

  • Can run overnight, over weekends, or for extended periods

  • Are defined by shell scripts with SLURM directives

  • Are ideal for production computational work

Basic SLURM Workflow#

  1. Write a job script with resource requests and commands

  2. Submit the job to the queue with sbatch

  3. Monitor the job with squeue and sacct

  4. Review output when job completes

Your First Batch Job#

Simple Job Script#

Create a file named simple_job.sh:

#!/bin/bash
#SBATCH --job-name=my_first_job
#SBATCH --output=output_%j.txt
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem=1G

# Print some information
echo "Job started on $(hostname) at $(date)"
echo "Job ID: $SLURM_JOB_ID"
echo "Running on $SLURM_NNODES node(s)"

# Do some work
sleep 30
echo "Hello from NMTHPC!"

# Finish
echo "Job finished at $(date)"

Submit the Job#

$ sbatch simple_job.sh
Submitted batch job 12345

Check Job Status#

$ squeue -u $USER

View Output#

After job completes:

$ cat output_12345.txt

SLURM Script Components#

The Shebang#

#!/bin/bash

Must be the first line. Specifies the shell interpreter.

SLURM Directives#

Lines starting with #SBATCH are SLURM directives:

#SBATCH --option=value

Common directives:

#SBATCH --job-name=my_job           # Job name
#SBATCH --output=output_%j.txt      # Output file (%j = job ID)
#SBATCH --error=error_%j.txt        # Error file (separate from output)
#SBATCH --ntasks=4                  # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1           # CPUs per task (for threading)
#SBATCH --nodes=1                   # Number of nodes
#SBATCH --mem=16G                   # Memory per node
#SBATCH --time=04:00:00             # Time limit (HH:MM:SS)
#SBATCH --partition=cpu.std         # Partition name
#SBATCH --mail-type=END,FAIL        # Email notifications
#SBATCH --mail-user=you@nmt.edu     # Your email

Environment and Commands#

After directives, add your actual work:

# Load modules
module load python/3.11

# Set environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Change to working directory (usually already there)
cd $SLURM_SUBMIT_DIR

# Run your program
python my_script.py

Resource Requests#

CPU Resources#

Single task:

#SBATCH --ntasks=1

Multiple tasks (for MPI):

#SBATCH --ntasks=16          # 16 MPI processes

Multithreaded application:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8    # 8 threads

Hybrid MPI + OpenMP:

#SBATCH --ntasks=4           # 4 MPI processes
#SBATCH --cpus-per-task=4    # 4 threads each = 16 total CPUs

Memory Requests#

Total memory per node:

#SBATCH --mem=32G            # 32 GB total

Memory per CPU:

#SBATCH --mem-per-cpu=4G     # 4 GB per CPU

Tip

Use --mem for most cases. Use --mem-per-cpu when memory needs scale with CPU count.

Time Limits#

Format: Days-Hours:Minutes:Seconds

#SBATCH --time=01:00:00      # 1 hour
#SBATCH --time=04:30:00      # 4.5 hours
#SBATCH --time=2-00:00:00    # 2 days
#SBATCH --time=7-12:00:00    # 7.5 days

Warning

Always specify a realistic time limit. Jobs are killed when time expires. Add ~20% buffer to your estimate.

Partition Selection#

#SBATCH --partition=cpu.std   # Use standard partition
#SBATCH --partition=gpu       # Use GPU partition

See Partitions and QOS for available partitions.

Complete Job Script Examples#

Serial Job (Single CPU)#

#!/bin/bash
#SBATCH --job-name=serial_job
#SBATCH --output=serial_%j.out
#SBATCH --ntasks=1
#SBATCH --mem=8G
#SBATCH --time=02:00:00

module load python/3.11

python my_script.py input.txt output.txt

Parallel Job (MPI)#

#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --output=mpi_%j.out
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=2G
#SBATCH --time=04:00:00

module load gcc/11.2.0
module load openmpi/4.1.4

mpirun ./my_mpi_program

Multithreaded Job (OpenMP)#

#!/bin/bash
#SBATCH --job-name=openmp_job
#SBATCH --output=openmp_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=03:00:00

module load gcc/11.2.0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./my_openmp_program

Python Job with Anaconda#

#!/bin/bash
#SBATCH --job-name=python_analysis
#SBATCH --output=analysis_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=06:00:00

module load anaconda3

source activate myenv

python analysis.py --input data.csv --output results.txt

See Anaconda for more Python examples.

Job Submission and Management#

Submitting Jobs#

Submit script:

$ sbatch myjob.sh
Submitted batch job 12345

Submit with command-line options (overrides script):

$ sbatch --time=01:00:00 --mem=8G myjob.sh

Submit from specific directory:

$ cd /path/to/workdir
$ sbatch myjob.sh

Monitoring Jobs#

View your jobs:

$ squeue -u $USER

Detailed job information:

$ scontrol show job 12345

Job history:

$ sacct -j 12345

See Monitoring Resources for comprehensive monitoring guide.

Canceling Jobs#

Cancel specific job:

$ scancel 12345

Cancel all your jobs:

$ scancel -u $USER

Cancel jobs by name:

$ scancel --name=myjob

Job Dependencies#

Running Jobs in Sequence#

Job 2 starts after Job 1 completes:

$ JOB1=$(sbatch --parsable job1.sh)
$ sbatch --dependency=afterok:$JOB1 job2.sh

Dependency types:

  • after:jobid: Start after jobid starts

  • afterok:jobid: Start after jobid completes successfully

  • afternotok:jobid: Start if jobid fails

  • afterany:jobid: Start after jobid completes (any exit status)

Example workflow:

$ JOB1=$(sbatch --parsable preprocessing.sh)
$ JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 analysis.sh)
$ JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 postprocessing.sh)

Job Environment Variables#

SLURM sets useful environment variables in your job:

$SLURM_JOB_ID              # Job ID
$SLURM_JOB_NAME            # Job name
$SLURM_SUBMIT_DIR          # Directory where sbatch was run
$SLURM_NTASKS              # Number of tasks
$SLURM_CPUS_PER_TASK       # CPUs per task
$SLURM_NNODES              # Number of nodes
$SLURM_NODELIST            # List of allocated nodes
$SLURM_ARRAY_TASK_ID       # Array task ID (for job arrays)

Using in scripts:

echo "Running on $SLURM_NNODES nodes"
echo "Output directory: $SLURM_SUBMIT_DIR/output_$SLURM_JOB_ID"

Output and Error Files#

Default Behavior#

By default, SLURM creates:

slurm-JOBID.out  # Combined stdout and stderr

Custom Output Files#

Separate output and error:

#SBATCH --output=output_%j.txt
#SBATCH --error=error_%j.txt

Include job name and ID:

#SBATCH --output=%x_%j.out    # %x = job name, %j = job ID

Output to subdirectory:

#SBATCH --output=logs/job_%j.out

Make sure the directory exists first:

$ mkdir -p logs
$ sbatch myjob.sh

Viewing Output While Job Runs#

Follow output in real-time:

$ tail -f slurm-12345.out

Last 50 lines:

$ tail -50 slurm-12345.out

Best Practices#

Resource Requests#

1. Test first with small jobs:

# Test job
#SBATCH --time=00:30:00
#SBATCH --mem=4G

2. Request what you need + buffer:

# If test used 12 GB and 3 hours:
#SBATCH --mem=16G          # 33% buffer
#SBATCH --time=04:00:00    # 33% buffer

3. Don’t over-request:

  • Wastes resources

  • Lowers priority

  • Longer queue times

Job Organization#

1. Use descriptive names:

#SBATCH --job-name=protein_fold_1a2b

2. Organize output files:

mkdir -p logs results
#SBATCH --output=logs/%x_%j.out

3. Document your scripts:

#!/bin/bash
# Purpose: Analyze RNA-seq data from experiment XYZ
# Author: Your Name
# Date: 2024-01-15

Error Handling#

1. Check for errors in script:

#!/bin/bash
#SBATCH directives...

# Exit on any error
set -e

# Check if input file exists
if [ ! -f input.dat ]; then
    echo "Error: input.dat not found"
    exit 1
fi

# Run program
./my_program input.dat

2. Validate output:

# Check if output was created
if [ ! -f output.dat ]; then
    echo "Error: output.dat not created"
    exit 1
fi

Email Notifications#

Get notified of job events:

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=yourname@nmt.edu

Notification types:

  • BEGIN: Job starts

  • END: Job completes successfully

  • FAIL: Job fails

  • ALL: All events

  • TIME_LIMIT: Job reaches time limit

Troubleshooting#

Job Fails Immediately#

Check output file:

$ cat slurm-12345.out

Common causes:

  • Module not loaded

  • Input file not found

  • Wrong path to executable

  • Typo in script

Job Killed - Out of Memory#

Check with sacct:

$ sacct -j 12345 --format=JobID,State,MaxRSS,ReqMem

If MaxRSS is close to ReqMem:

Solution: Increase memory request

#SBATCH --mem=32G  # Increased from 16G

Job Killed - Time Limit#

Check time used:

$ sacct -j 12345 --format=JobID,Elapsed,Timelimit,State

Solution: Increase time limit

#SBATCH --time=08:00:00  # Increased from 4 hours

Job Pending Forever#

Check reason:

$ squeue -u $USER -o "%.18i %.30j %.20R"

Common reasons and solutions:

  • Resources: Wait or reduce request

  • Priority: Your fair-share is low (wait)

  • QOSMaxCpuPerUserLimit: Cancel or wait for running jobs

  • PartitionNodeLimit: Requested too many nodes

See Partitions and QOS for more information.

Advanced Topics#

Job Arrays#

For running many similar jobs, see Using SLURM Job Arrays.

GPU Jobs#

For GPU computing, see Running Jobs on GPU Nodes.

Parallel Programming#

For MPI and parallel programming:

Job Script Template#

Save this as template.sh:

#!/bin/bash
#SBATCH --job-name=CHANGEME
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00
#SBATCH --partition=std.cpu
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=you@nmt.edu

# Exit on error
set -e

# Print job information
echo "Job started on $(hostname) at $(date)"
echo "Job ID: $SLURM_JOB_ID"
echo "Working directory: $(pwd)"

# Load modules
module purge
module load python/3.11

# Your commands here
python my_script.py

# Finish
echo "Job completed at $(date)"

Make logs directory:

$ mkdir -p logs

Summary#

Key SLURM Commands:

Task

Command

Submit job

sbatch script.sh

View queue

squeue -u $USER

Job details

scontrol show job JOBID

Job history

sacct -j JOBID

Cancel job

scancel JOBID

Job efficiency

seff JOBID

Next Steps:

Questions?#

For questions about batch jobs or SLURM, contact hpc@nmthpc.atlassian.net.