TrainCraft¶
Build training data for machine-learned interatomic potentials — systematically.
TrainCraft is a modular Python toolkit for generating, selecting, and labeling atomic structures to train MACE and other MLIPs. It covers the full pipeline: geometry → sampling → selection → (DFT labeling) → training — with a clean config file as the only glue.
Why TrainCraft?¶
-
Every structure type covered
Bulk crystals, point defects, bare slabs, molecules on surfaces, 2D bilayers, moiré twists, SMILES-derived conformers — all from one TOML key.
-
Principled dataset selection
A composable funnel — physicality → dedup → diversity (FPS) — removes unphysical frames and redundant near-duplicates before any expensive DFT.
-
MACE-first, model-agnostic
Foundation models (
mace-mp0,mace-off23) work out of the box. Swap in a local fine-tuned checkpoint with one config line. -
Plugin architecture
Every builder, calculator, sampler, and selector is a decorated function in the registry. Adding a new capability is one new file — no dispatcher to edit.
-
Provenance everywhere
Every frame records exactly how it was made: source, builder, transforms, calculator, seed. The
origintag keeps cheap and expensive data separable. -
Config is data
One TOML file drives the whole workflow. Validated by pydantic v2 — typos fail loudly. The same format a future workflow editor would emit.
30-second install¶
# Recommended: pixi (manages conda-forge + PyPI in one lockfile)
curl -fsSL https://pixi.sh/install.sh | sh
git clone https://github.com/basillicus/traincraft && cd traincraft
pixi install # core dependencies
pixi install -e dev # + pytest / ruff / mypy
# Alternative: pip / uv
pip install traincraft # or: uv pip install traincraft
pip install "traincraft[geometry]" # + rdkit, pymatgen, packmol
30-second example¶
Create my_run.toml:
[run]
name = "hello_traincraft"
[geometry.builder]
type = "nanotube"
n = 5
m = 0
length = 1
[calculator]
type = "emt"
[sampling]
type = "md"
temperature = 300.0
steps = 200
interval = 20
[selection]
steps = ["physicality", "dedup", "diversity"]
budget = 5
[dataset]
path = "dataset"
Run it:
Done:
workspace: runs/hello_traincraft
n_candidates: 11
n_selected: 5
dataset: runs/hello_traincraft/dataset.extxyz
Five diverse, physically valid frames — ready to label with DFT. That's the whole loop.
What's next?¶
Start with Tutorial 1: Your First Dataset — it walks through every section of the config in detail, explaining why each choice exists.
Jump to the tutorial for your system:
- Molecules & SMILES — organic molecules, RDKit conformers
- Molecules on Surfaces — adsorbate coverage, MC sampling
- Crystals & Defects — bulk, vacancies, substitutions
- Slabs & Strain — surface models, mechanical deformation
- 2D Materials — graphene, hBN, MoS₂, moiré stacks
See the Config Schema for every TOML field, the CLI reference for all commands, or the Python API for library usage.