Back to docs

HPC / SLURM Setup

This guide covers deploying pAI MSc on SLURM-managed HPC clusters.


Prerequisites

  • Access to a SLURM cluster with outbound internet
  • Conda installed (Miniconda or Miniforge)
  • API keys for at least one LLM provider

Quick Start

git clone https://github.com/PoggioAI/PoggioAI_MSc.git && cd PoggioAI_MSc
git checkout MSc_Prod
conda create -n msc python=3.12 -y
conda activate msc
pip install -e ".[all]"
msc setup
msc doctor
msc run --mode hpc "Analyze convergence properties of adaptive optimizers"

Two-Tier Execution Model

Tier 1: Orchestrator (CPU)

  • Runs on a CPU partition
  • Makes outbound HTTPS calls to LLM APIs
  • Coordinates 22+ specialist agents via LangGraph

Tier 2: Experiment Jobs (GPU)

  • Submitted by the orchestrator via sbatch
  • Runs experiment execution workloads
  • Partition/resources configured in engaging_config.yaml

Set CONSORTIUM_SLURM_ENABLED=1 to enable automatic GPU job submission.

Cluster Configuration

cluster:
  name: engaging
  orchestrator:
    partition: your_partition
    time: "7-00:00:00"
    cpus: 4
    mem: "32G"
  experiment_gpu:
    partition: your_gpu_partition
    time: "7-00:00:00"
    cpus: 8
    mem: "64G"
    gres: "gpu:a100:1"

Running on SLURM

msc run --mode hpc "Your research question"

Campaigns:

msc campaign init --name "my_project" --task "Your research directive"
msc campaign start my_project_campaign.yaml
msc campaign status my_project_campaign.yaml

OpenClaw oversight:

msc openclaw setup
msc openclaw start
msc openclaw status

Monitoring

squeue -u $USER
msc status
msc runs
msc budget
msc campaign status campaign.yaml

Troubleshooting

conda not found

export CONDA_INIT_SCRIPT=/path/to/miniforge3/etc/profile.d/conda.sh
source "$CONDA_INIT_SCRIPT"

GPU experiment job fails

Check SLURM logs in your run output directory.

API calls fail from compute node

Use a node/partition with outbound internet for the orchestrator.