Using SLURM Job Arrays#

Job arrays allow you to submit and manage large numbers of similar jobs efficiently. This guide covers how to use SLURM job arrays effectively.

What are Job Arrays?#

Job arrays let you submit many similar jobs with a single script:

Benefits:

  • Submit hundreds or thousands of jobs with one command

  • Each job gets a unique task ID

  • Easier to manage than individual jobs

  • More efficient than submitting jobs one-by-one

Use cases:

  • Parameter sweeps

  • Processing multiple input files

  • Monte Carlo simulations

  • Batch processing datasets

  • Sensitivity analyses

Basic Job Array#

Simple Example#

Create a file named array_job.sh:

#!/bin/bash
#SBATCH --job-name=my_array
#SBATCH --output=output_%A_%a.txt
#SBATCH --array=1-10
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00

echo "This is array task $SLURM_ARRAY_TASK_ID"
echo "Array job ID: $SLURM_ARRAY_JOB_ID"

# Your work here
sleep 10
echo "Task $SLURM_ARRAY_TASK_ID completed"

Submit the Array#

$ sbatch array_job.sh
Submitted batch job 12345

This creates 10 individual jobs (tasks 1-10), each with a unique task ID.

Check Array Jobs#

$ squeue -u $USER

Output shows:

JOBID    PARTITION  NAME       USER   ST  TIME  NODES NODELIST
12345_1  standard   my_array   user   R   0:05  1     node01
12345_2  standard   my_array   user   R   0:05  1     node02
12345_3  standard   my_array   user   PD  0:00  1     (Resources)
...

Array Specification#

Array Ranges#

Sequential range:

#SBATCH --array=1-100        # Tasks 1 through 100

With step size:

#SBATCH --array=1-100:2      # Tasks 1,3,5,...,99
#SBATCH --array=0-50:5       # Tasks 0,5,10,...,50

Specific values:

#SBATCH --array=1,5,10,15    # Only these specific tasks

Combined:

#SBATCH --array=1-10,15,20,25-30  # Multiple ranges and values

Limiting Concurrent Tasks#

Limit how many tasks run simultaneously:

#SBATCH --array=1-1000%20    # Run max 20 tasks at a time

This submits 1000 tasks but only runs 20 concurrently.

Tip

Use % to limit concurrent tasks when submitting very large arrays. This prevents overwhelming the scheduler and shares resources fairly.

Array Environment Variables#

SLURM provides special variables for array jobs:

$SLURM_ARRAY_JOB_ID       # Main job ID (same for all tasks)
$SLURM_ARRAY_TASK_ID      # Unique task ID within array
$SLURM_ARRAY_TASK_COUNT   # Total number of tasks
$SLURM_ARRAY_TASK_MIN     # First task ID
$SLURM_ARRAY_TASK_MAX     # Last task ID

Output Files#

Using Array Variables in Filenames#

In SLURM directives:

  • %A = array job ID

  • %a = array task ID

  • %j = job ID (includes task ID for arrays)

Example:

#SBATCH --output=logs/job_%A_task_%a.out
#SBATCH --error=logs/job_%A_task_%a.err

For array job 12345, this creates:

logs/job_12345_task_1.out
logs/job_12345_task_2.out
...

Practical Examples#

Processing Multiple Files#

Scenario: Process 100 data files named data_001.txt through data_100.txt

#!/bin/bash
#SBATCH --job-name=process_files
#SBATCH --output=logs/process_%A_%a.out
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --mem=8G
#SBATCH --time=02:00:00

module load python/3.11

# Create input filename with zero-padding
INPUT_FILE=$(printf "data_%03d.txt" $SLURM_ARRAY_TASK_ID)
OUTPUT_FILE=$(printf "results_%03d.txt" $SLURM_ARRAY_TASK_ID)

# Process the file
python process.py --input $INPUT_FILE --output $OUTPUT_FILE

echo "Processed $INPUT_FILE"

Using a File List#

Scenario: Process files listed in a text file

File list (files.txt):

/data/sample_A.dat
/data/sample_B.dat
/data/sample_C.dat
...

Job script:

#!/bin/bash
#SBATCH --job-name=process_list
#SBATCH --output=logs/job_%A_%a.out
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=01:00:00

module load python/3.11

# Get the filename from line number equal to task ID
INPUT_FILE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" files.txt)

# Process the file
python analyze.py $INPUT_FILE

echo "Processed $INPUT_FILE"

Parameter Sweep#

Scenario: Test different parameter combinations

#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --output=logs/sweep_%A_%a.out
#SBATCH --array=1-27
#SBATCH --ntasks=1
#SBATCH --mem=8G
#SBATCH --time=04:00:00

module load python/3.11

# Define parameter arrays
ALPHAS=(0.1 0.5 1.0)
BETAS=(1.0 10.0 100.0)
GAMMAS=(0.01 0.1 1.0)

# Calculate indices (3x3x3 = 27 combinations)
NUM_BETA=3
NUM_GAMMA=3

IDX=$((SLURM_ARRAY_TASK_ID - 1))
ALPHA_IDX=$((IDX / (NUM_BETA * NUM_GAMMA)))
BETA_IDX=$(((IDX / NUM_GAMMA) % NUM_BETA))
GAMMA_IDX=$((IDX % NUM_GAMMA))

ALPHA=${ALPHAS[$ALPHA_IDX]}
BETA=${BETAS[$BETA_IDX]}
GAMMA=${GAMMAS[$GAMMA_IDX]}

echo "Running with alpha=$ALPHA, beta=$BETA, gamma=$GAMMA"

# Run simulation with these parameters
python simulate.py --alpha $ALPHA --beta $BETA --gamma $GAMMA \
    --output results_${ALPHA}_${BETA}_${GAMMA}.dat

Monte Carlo Simulations#

Scenario: Run 1000 independent simulations with different random seeds

#!/bin/bash
#SBATCH --job-name=monte_carlo
#SBATCH --output=logs/mc_%A_%a.out
#SBATCH --array=1-1000%50      # Max 50 concurrent
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --time=02:00:00

module load python/3.11

# Use task ID as random seed
SEED=$SLURM_ARRAY_TASK_ID

# Run simulation
python monte_carlo.py --seed $SEED --output mc_$SEED.dat

echo "Simulation $SEED completed"

GPU Array Jobs#

Scenario: Train multiple models on GPUs

#!/bin/bash
#SBATCH --job-name=train_models
#SBATCH --output=logs/train_%A_%a.out
#SBATCH --array=1-10
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=12:00:00

module load cuda/12.1
module load python/3.11

# Each task trains a different model configuration
CONFIG_FILE="config_${SLURM_ARRAY_TASK_ID}.yaml"

# Train model
python train.py --config $CONFIG_FILE --gpu 0 \
    --output models/model_$SLURM_ARRAY_TASK_ID.pth

echo "Model $SLURM_ARRAY_TASK_ID trained"

See Running Jobs on GPU Nodes for more GPU information.

Managing Array Jobs#

Viewing Array Jobs#

All your array tasks:

$ squeue -u $USER

Specific array job:

$ squeue -j 12345

Summary view:

$ squeue -u $USER -r  # -r shows array tasks as ranges

Canceling Array Jobs#

Cancel entire array:

$ scancel 12345

Cancel specific task:

$ scancel 12345_5  # Cancel only task 5

Cancel range of tasks:

$ scancel 12345_[10-20]  # Cancel tasks 10-20

Array Job Status#

Check completion:

$ sacct -j 12345

Summary of task states:

$ sacct -j 12345 --format=JobID,State | grep -c COMPLETED
$ sacct -j 12345 --format=JobID,State | grep -c FAILED

Post-Processing Array Results#

Combining Results#

Merge all output files:

#!/bin/bash
# Combine results from array job
for i in {1..100}; do
    cat results_$i.txt >> combined_results.txt
done

Using Python:

import glob
import pandas as pd

# Read all result files
all_files = glob.glob("results_*.csv")
df_list = [pd.read_csv(f) for f in sorted(all_files)]

# Combine into single DataFrame
combined = pd.concat(df_list, ignore_index=True)
combined.to_csv("combined_results.csv", index=False)

Checking for Missing Tasks#

Script to check completion:

#!/bin/bash
ARRAY_ID=12345
NUM_TASKS=100

for i in $(seq 1 $NUM_TASKS); do
    if [ ! -f "results_${i}.txt" ]; then
        echo "Missing task $i"
    fi
done

Troubleshooting#

Some Tasks Failed#

Find failed tasks:

$ sacct -j 12345 --format=JobID,State | grep FAILED

Rerun specific failed tasks:

#SBATCH --array=5,12,27,33  # Only failed task IDs

Out of Memory on Some Tasks#

Check memory usage:

$ sacct -j 12345 --format=JobID,MaxRSS,ReqMem,State

Solutions:

  1. Increase memory for all tasks (wastes resources)

  2. Identify high-memory tasks and run separately

  3. Modify code to use less memory

Tasks Taking Too Long#

Check task times:

$ sacct -j 12345 --format=JobID,Elapsed,State | sort -k2 -h

Solutions:

  • Increase time limit if tasks timeout

  • Investigate slow tasks

  • Consider splitting into multiple arrays by estimated runtime

Too Many Array Tasks#

Most HPC systems limit array sizes (e.g., 1000-10000 tasks).

If you need more:

Option 1: Use multiple array submissions

$ sbatch --array=1-1000 script.sh
$ sbatch --array=1001-2000 script.sh
$ sbatch --array=2001-3000 script.sh

Option 2: Process multiple items per task

#!/bin/bash
#SBATCH --array=1-100

# Each task processes 100 files
START=$(((SLURM_ARRAY_TASK_ID - 1) * 100 + 1))
END=$((SLURM_ARRAY_TASK_ID * 100))

for i in $(seq $START $END); do
    python process.py file_$i.txt
done

Advanced Array Patterns#

Nested Arrays#

Process a matrix of conditions:

#!/bin/bash
#SBATCH --array=1-100  # 10x10 matrix

# Define 10 values for each parameter
PARAM1=($(seq 0.1 0.1 1.0))
PARAM2=($(seq 1 1 10))

# Calculate indices
IDX=$((SLURM_ARRAY_TASK_ID - 1))
I=$((IDX / 10))
J=$((IDX % 10))

# Run with specific parameters
./simulation ${PARAM1[$I]} ${PARAM2[$J]}

Dynamic Task Generation#

Generate task list dynamically:

#!/bin/bash
#SBATCH --array=1-$(wc -l < task_list.txt)

# Read task from list
TASK=$(sed -n "${SLURM_ARRAY_TASK_ID}p" task_list.txt)

# Execute task
eval $TASK

Example: Complete Data Processing Pipeline#

#!/bin/bash
#SBATCH --job-name=data_pipeline
#SBATCH --output=logs/pipeline_%A_%a.out
#SBATCH --error=logs/pipeline_%A_%a.err
#SBATCH --array=1-100%20
#SBATCH --ntasks=1
#SBATCH --mem=16G
#SBATCH --time=04:00:00
#SBATCH --mail-type=ARRAY_TASKS
#SBATCH --mail-user=you@nmt.edu

# Exit on error
set -e

# Create output directories
mkdir -p results/$SLURM_ARRAY_TASK_ID

# Load modules
module purge
module load python/3.11

# Define input
INPUT_FILE=$(printf "data/input_%03d.txt" $SLURM_ARRAY_TASK_ID)
OUTPUT_DIR="results/$SLURM_ARRAY_TASK_ID"

# Check input exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: $INPUT_FILE not found"
    exit 1
fi

# Process
echo "Processing $INPUT_FILE"
python preprocess.py --input $INPUT_FILE --output $OUTPUT_DIR/preprocessed.dat
python analyze.py --input $OUTPUT_DIR/preprocessed.dat --output $OUTPUT_DIR/results.txt
python visualize.py --input $OUTPUT_DIR/results.txt --output $OUTPUT_DIR/plot.png

echo "Task $SLURM_ARRAY_TASK_ID completed successfully"

Questions?#

For questions about job arrays, contact hpc@nmthpc.atlassian.net.