Run on HPC (Slurm + Apptainer)¶
TrainCraft dispatches the pipeline to any Slurm cluster as dependency-chained jobs, each running in an Apptainer image. Nothing here is specific to one machine — the account, partitions, modules, image paths, and bind mounts are all config. CINECA Leonardo appears only as a worked example at the end.
Architecture and rationale:
DESIGN.md§20. Definition files and build notes:containers/.
The images¶
TrainCraft ships as a small set of images so each concern (hardware, rebuild cadence, licensing) is isolated:
| Image | Target | Contents |
|---|---|---|
traincraft-core |
CPU | traincraft + CPU science stack. The orchestrator. |
traincraft-mlip |
GPU | traincraft + PyTorch + CUDA + MACE. Sampling + training. |
traincraft-qe |
CPU | Quantum ESPRESSO (open source). DFT labeling. |
traincraft-dft |
CPU | FHI-aims (licensed). DFT labeling — needed for polarizability. |
You only build the DFT image you actually use. QE is fully open source and needs no license, so the whole workflow can run end-to-end with open tooling only.
Run model — the orchestrator dispatches Slurm steps¶
traincraft-core does not nest container execs. Each enabled stage becomes a
Slurm job that apptainer execs the right image and runs traincraft stage <name>:
# GPU sampling/training — traincraft runs inside the mlip image
apptainer exec --nv --bind "$SCRATCH" traincraft-mlip.sif \
traincraft stage sample config.toml
# DFT label — traincraft runs in core; the engine binary runs in its DFT image
# under srun via the injected command (QE shown; FHI-aims is identical):
export TRAINCRAFT_PW_COMMAND="srun --mpi=pmix apptainer exec --bind $SCRATCH traincraft-qe.sif pw.x"
apptainer exec --bind "$SCRATCH" traincraft-core.sif traincraft stage label config.toml
traincraft submit generates and chains these for you (--dependency=afterok).
The DFT command is injected from the environment, never hard-coded in a plugin
(DESIGN §20.3), so the same config runs locally, with QE, with FHI-aims, on any
cluster.
Building images¶
cd containers
apptainer build --fakeroot traincraft-core.sif traincraft-core.def
apptainer build --fakeroot traincraft-mlip.sif traincraft-mlip.def
apptainer build --fakeroot traincraft-qe.sif traincraft-qe.def # open source
# FHI-aims is licensed: supply the source at build time; never publish the .sif
apptainer build --fakeroot --build-arg AIMS_SRC=/path/to/fhi-aims.tar.gz \
traincraft-dft.sif traincraft-dft.def
If your login nodes disallow rooted builds, build with --fakeroot where allowed
or build off-cluster and copy the .sif over.
Configuring dispatch¶
[orchestration]
engine = "slurm"
[orchestration.slurm]
account = "<your-account>" # your scheduler account/project
runtime = "apptainer" # "apptainer" (our images) or "native" (host binaries)
mpi = "pmix" # Slurm MPI plugin — see "Picking the MPI plugin"
sif_dir = "$WORK/sif" # where the .sif images live
modules = ["apptainer"] # `module load` lines for your site
binds = ["$SCRATCH", "$WORK"] # filesystems to bind into the containers
[orchestration.slurm.stages.sample]
image = "traincraft-mlip.sif"
partition = "<gpu-partition>"
gpus = 1
[orchestration.slurm.stages.label]
partition = "<cpu-partition>"
nodes = 2
ntasks = 224
# pw_command / aims_command default to the qe / dft images; override if needed.
Submit:
traincraft submit config.toml --dry-run # render + inspect the sbatch scripts
traincraft submit config.toml # sbatch, dependency-chained
Cluster profiles — switch HPC by changing one line¶
The [orchestration.slurm] block above is per cluster, not per workflow.
Rather than copy it into every config, save it once as a named profile and
reference it. Profiles live in ~/.traincraft/clusters/<name>.toml (override the
directory with TRAINCRAFT_CLUSTERS_DIR) and contain exactly the keys that would
go under [orchestration.slurm]:
# ~/.traincraft/clusters/leonardo.toml
account = "EUHPC_xxxxxxx"
runtime = "apptainer"
mpi = "pmix"
sif_dir = "$WORK/sif"
modules = ["apptainer"]
binds = ["$SCRATCH", "$WORK"]
[stages.sample]
partition = "boost_usr_prod"
gpus = 1
[stages.label]
partition = "dcgp_usr_prod"
nodes = 2
ntasks = 224
A workflow then targets that cluster with one line:
[orchestration]
engine = "slurm"
[orchestration.slurm]
profile = "leonardo" # ← change to "lumi" to run the same workflow elsewhere
Any inline [orchestration.slurm] keys you add override the profile (handy
for testing — e.g. drop ntasks while debugging), and inline per-stage tables
deep-merge onto the profile's. The profile is the base; inline wins.
Two knobs that make it portable: runtime and mpi¶
There is no universal MPI setup — the cluster decides. TrainCraft exposes this as two independent config switches instead of baking in assumptions:
| Switch | Values | What it controls |
|---|---|---|
runtime |
apptainer | native |
Reach binaries via our images, or via host binaries already installed (site modules / conda / EasyBuild). native drops the container wrapper entirely. |
mpi |
pmix | cray_shasta | pmi2 | none |
The Slurm MPI plugin used to launch the multi-node DFT step (srun --mpi=<plugin>). |
Both can be overridden per stage in [orchestration.slurm.stages.*].
Picking the MPI plugin¶
Run this on the target cluster — it is the ground truth:
pmixpresent (InfiniBand + Slurm, e.g. Leonardo) →mpi = "pmix". Our images carry a self-contained OpenMPI+UCX+PMIx, so Slurm does the wire-up and no host MPI is needed.- No
pmix, Cray/Slingshot (e.g. LUMI shows onlycray_shasta/pmi2) →mpi = "cray_shasta". On Cray the path of least resistance isruntime = "native"with the site'scray-mpichand its FHI-aims/QE module, rather than fighting ABI translation inside a container. - Anything else →
mpi = "pmi2"is the portable fallback.
When to use runtime = "native"¶
Use it when the cluster already provides tuned binaries (a site/EasyBuild FHI-aims
or QE, or your own conda/venv), or when bind-mounting containers is awkward (Cray).
TrainCraft then renders bare srun --mpi=<plugin> aims.x and runs traincraft
stage … directly — put the needed module load / source activate lines in
modules and pre_commands. See the LUMI example below.
Worked example: CINECA Leonardo (Apptainer + PMIx)¶
Leonardo has a GPU Booster (A100) and a CPU DCGP partition, and
srun --mpi=list shows pmix. So: our images, PMIx launch. Full file:
examples/19_hpc_leonardo_label.toml.
[orchestration.slurm]
account = "EUHPC_xxxxxxx"
runtime = "apptainer"
mpi = "pmix"
sif_dir = "$WORK/sif"
modules = ["apptainer"]
binds = ["$SCRATCH", "$WORK"]
[orchestration.slurm.stages.sample] # MACE on the Booster
image = "traincraft-mlip.sif"
partition = "boost_usr_prod"
gpus = 1
[orchestration.slurm.stages.label] # DFT on DCGP
partition = "dcgp_usr_prod"
nodes = 2
ntasks = 224
Fill the TODO(site) markers in containers/traincraft-dft.def (target arch, MKL
link line) when building the FHI-aims image.
Worked example: LUMI (native + cray_shasta)¶
LUMI is a Cray EX: Slingshot interconnect, and srun --mpi=list shows no
pmix (only cray_shasta / pmi2). The clean path is runtime = "native" using
the site's cray-mpich and FHI-aims/QE modules. Full file:
examples/20_hpc_lumi_native.toml.
[orchestration.slurm]
account = "project_465xxxxxx"
runtime = "native" # use host binaries, not our .sif images
mpi = "cray_shasta" # no pmix on Cray; this drives Slingshot
binds = []
modules = ["LUMI/24.03", "partition/C", "cray-mpich"]
pre_commands = ["source $HOME/traincraft-venv/bin/activate"] # traincraft on PATH
[orchestration.slurm.stages.sample]
partition = "standard-g"
gpus = 1
[orchestration.slurm.stages.label]
partition = "standard"
nodes = 2
ntasks = 256
pre_commands = ["module load fhi-aims/240507"] # the site's FHI-aims build
This renders bare srun --mpi=cray_shasta aims.x with no container — the two knobs
(runtime, mpi) are the only things that change between Leonardo and LUMI; the
science config is identical.