Using SLURM Job Arrays

Using SLURM Job Arrays#

Job arrays allow you to submit and manage large numbers of similar jobs efficiently. This guide covers how to use SLURM job arrays effectively.

What are Job Arrays?#

Job arrays let you submit many similar jobs with a single script:

Benefits:

Submit hundreds or thousands of jobs with one command
Each job gets a unique task ID
Easier to manage than individual jobs
More efficient than submitting jobs one-by-one

Use cases:

Parameter sweeps
Processing multiple input files
Monte Carlo simulations
Batch processing datasets
Sensitivity analyses

Basic Job Array#

Simple Example#

Create a file named array_job.sh:

#!/bin/bash
#SBATCH --job-name=my_array
#SBATCH --output=output_%A_%a.txt
#SBATCH --array=1-10
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00

echo "This is array task $SLURM_ARRAY_TASK_ID"
echo "Array job ID: $SLURM_ARRAY_JOB_ID"

# Your work here
sleep 10
echo "Task $SLURM_ARRAY_TASK_ID completed"

Submit the Array#

$ sbatch array_job.sh
Submitted batch job 12345

This creates 10 individual jobs (tasks 1-10), each with a unique task ID.

Check Array Jobs#

$ squeue -u $USER

Output shows:

JOBID    PARTITION  NAME       USER   ST  TIME  NODES NODELIST
12345_1  standard   my_array   user   R   0:05  1     node01
12345_2  standard   my_array   user   R   0:05  1     node02
12345_3  standard   my_array   user   PD  0:00  1     (Resources)
...

Array Specification#

Array Ranges#

Sequential range:

#SBATCH --array=1-100        # Tasks 1 through 100

With step size:

#SBATCH --array=1-100:2      # Tasks 1,3,5,...,99
#SBATCH --array=0-50:5       # Tasks 0,5,10,...,50

Specific values:

#SBATCH --array=1,5,10,15    # Only these specific tasks

Combined:

#SBATCH --array=1-10,15,20,25-30  # Multiple ranges and values

Limiting Concurrent Tasks#

Limit how many tasks run simultaneously:

#SBATCH --array=1-1000%20    # Run max 20 tasks at a time

This submits 1000 tasks but only runs 20 concurrently.

Tip

Use % to limit concurrent tasks when submitting very large arrays. This prevents overwhelming the scheduler and shares resources fairly.

Array Environment Variables#

SLURM provides special variables for array jobs:

$SLURM_ARRAY_JOB_ID       # Main job ID (same for all tasks)
$SLURM_ARRAY_TASK_ID      # Unique task ID within array
$SLURM_ARRAY_TASK_COUNT   # Total number of tasks
$SLURM_ARRAY_TASK_MIN     # First task ID
$SLURM_ARRAY_TASK_MAX     # Last task ID

Output Files#

Using Array Variables in Filenames#

In SLURM directives:

%A = array job ID
%a = array task ID
%j = job ID (includes task ID for arrays)

Example:

#SBATCH --output=logs/job_%A_task_%a.out
#SBATCH --error=logs/job_%A_task_%a.err

For array job 12345, this creates:

logs/job_12345_task_1.out
logs/job_12345_task_2.out
...

Practical Examples#

Processing Multiple Files#

Scenario: Process 100 data files named data_001.txt through data_100.txt

#!/bin/bash
#SBATCH --job-name=process_files
#SBATCH --output=logs/process_%A_%a.out
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --mem=8G
#SBATCH --time=02:00:00

module load python/3.11

# Create input filename with zero-padding
INPUT_FILE=$(printf "data_%03d.txt" $SLURM_ARRAY_TASK_ID)
OUTPUT_FILE=$(printf "results_%03d.txt" $SLURM_ARRAY_TASK_ID)

# Process the file
python process.py --input $INPUT_FILE --output $OUTPUT_FILE

echo "Processed $INPUT_FILE"

Using a File List#

Scenario: Process files listed in a text file

File list (files.txt):

/data/sample_A.dat
/data/sample_B.dat
/data/sample_C.dat
...

Job script:

#!/bin/bash
#SBATCH --job-name=process_list
#SBATCH --output=logs/job_%A_%a.out
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00

module load python/3.11

# Get the filename from line number equal to task ID
INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" files.txt)

# Process the file
python analyze.py $INPUT_FILE

echo "Processed $INPUT_FILE"

Parameter Sweep#

Scenario: Test different parameter combinations

#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --output=logs/sweep_%A_%a.out
#SBATCH --array=1-27
#SBATCH --ntasks=1
#SBATCH --mem=8G
#SBATCH --time=04:00:00

module load python/3.11

# Define parameter arrays
ALPHAS=(0.1 0.5 1.0)
BETAS=(1.0 10.0 100.0)
GAMMAS=(0.01 0.1 1.0)

# Calculate indices (3x3x3 = 27 combinations)
NUM_BETA=3
NUM_GAMMA=3

IDX=$((SLURM_ARRAY_TASK_ID - 1))
ALPHA_IDX=$((IDX / (NUM_BETA * NUM_GAMMA)))
BETA_IDX=$(((IDX / NUM_GAMMA) % NUM_BETA))
GAMMA_IDX=$((IDX % NUM_GAMMA))

ALPHA=${ALPHAS[$ALPHA_IDX]}
BETA=${BETAS[$BETA_IDX]}
GAMMA=${GAMMAS[$GAMMA_IDX]}

echo "Running with alpha=$ALPHA, beta=$BETA, gamma=$GAMMA"

# Run simulation with these parameters
python simulate.py --alpha $ALPHA --beta $BETA --gamma $GAMMA \
    --output results_${ALPHA}_${BETA}_${GAMMA}.dat

Monte Carlo Simulations#

Scenario: Run 1000 independent simulations with different random seeds

#!/bin/bash
#SBATCH --job-name=monte_carlo
#SBATCH --output=logs/mc_%A_%a.out
#SBATCH --array=1-1000%50      # Max 50 concurrent
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=02:00:00

module load python/3.11

# Use task ID as random seed
SEED=$SLURM_ARRAY_TASK_ID

# Run simulation
python monte_carlo.py --seed $SEED --output mc_$SEED.dat

echo "Simulation $SEED completed"

GPU Array Jobs#

Scenario: Train multiple models on GPUs

#!/bin/bash
#SBATCH --job-name=train_models
#SBATCH --output=logs/train_%A_%a.out
#SBATCH --array=1-10
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=12:00:00

module load cuda/12.1
module load python/3.11

# Each task trains a different model configuration
CONFIG_FILE="config_${SLURM_ARRAY_TASK_ID}.yaml"

# Train model
python train.py --config $CONFIG_FILE --gpu 0 \
    --output models/model_$SLURM_ARRAY_TASK_ID.pth

echo "Model $SLURM_ARRAY_TASK_ID trained"

See Running Jobs on GPU Nodes for more GPU information.

Managing Array Jobs#

Viewing Array Jobs#

All your array tasks:

$ squeue -u $USER

Specific array job:

$ squeue -j 12345

Summary view:

$ squeue -u $USER -r  # -r shows array tasks as ranges

Canceling Array Jobs#

Cancel entire array:

$ scancel 12345

Cancel specific task:

$ scancel 12345_5  # Cancel only task 5

Cancel range of tasks:

$ scancel 12345_[10-20]  # Cancel tasks 10-20

Array Job Status#

Check completion:

$ sacct -j 12345

Summary of task states:

$ sacct -j 12345 --format=JobID,State | grep -c COMPLETED
$ sacct -j 12345 --format=JobID,State | grep -c FAILED

Post-Processing Array Results#

Combining Results#

Merge all output files:

#!/bin/bash
# Combine results from array job
for i in {1..100}; do
    cat results_$i.txt >> combined_results.txt
done

Using Python:

import glob
import pandas as pd

# Read all result files
all_files = glob.glob("results_*.csv")
df_list = [pd.read_csv(f) for f in sorted(all_files)]

# Combine into single DataFrame
combined = pd.concat(df_list, ignore_index=True)
combined.to_csv("combined_results.csv", index=False)

Checking for Missing Tasks#

Script to check completion:

#!/bin/bash
ARRAY_ID=12345
NUM_TASKS=100

for i in $(seq 1 $NUM_TASKS); do
    if [ ! -f "results_${i}.txt" ]; then
        echo "Missing task $i"
    fi
done

Troubleshooting#

Some Tasks Failed#

Find failed tasks:

$ sacct -j 12345 --format=JobID,State | grep FAILED

Rerun specific failed tasks:

#SBATCH --array=5,12,27,33  # Only failed task IDs

Out of Memory on Some Tasks#

Check memory usage:

$ sacct -j 12345 --format=JobID,MaxRSS,ReqMem,State

Solutions:

Increase memory for all tasks (wastes resources)
Identify high-memory tasks and run separately
Modify code to use less memory

Tasks Taking Too Long#

Check task times:

$ sacct -j 12345 --format=JobID,Elapsed,State | sort -k2 -h