Tutorial 1 · Your First Dataset¶
What you'll learn: how every section of a TrainCraft TOML config works, and what happens at each stage of the pipeline.
Prerequisites: TrainCraft installed (core only — no heavy deps).
Time: ~10 minutes.
The complete config¶
Create examples/01_cnt_emt_md.toml (or use the one already in the repo):
[run]
name = "01_cnt_emt_md"
outdir = "runs"
seed = 42
[geometry.builder]
type = "nanotube"
n = 5
m = 0
length = 1
vacuum = 6.0
[calculator]
type = "emt"
[sampling]
type = "md"
temperature = 300.0
steps = 50
interval = 10
timestep = 1.0
[selection]
steps = ["physicality", "dedup", "diversity"]
budget = 3
min_distance = 0.7
[dataset]
path = "dataset"
Run it:
Now let's understand every line.
[run] — bookkeeping¶
-
name— the subdirectory created insideoutdir. All outputs land inruns/01_cnt_emt_md/. Changing the name starts a fresh workspace; the old one is untouched, making reruns safe. -
outdir— where workspaces live. Relative to the working directory when you calltraincraft run. -
seed— seeds NumPy's global RNG so every run is reproducible. Set tonullto use a random seed.
[geometry] — building the structure¶
[geometry.builder]
type = "nanotube" # (1)
n = 5 # (2)
m = 0 # (3)
length = 1 # (4)
vacuum = 6.0 # (5)
-
type— selects a registered builder. Available:nanotube,molecule,crystal,slab,layered,surface_adsorbate,surface_packing. -
n,m— the chiral indices of the carbon nanotube.(5,0)is a zigzag CNT;(5,5)would be armchair. -
length— number of unit cells along the tube axis. -
vacuum— addsvacuum/2Å of vacuum on each side in the radial directions. This makes the cell non-periodic in xy and periodic in z (the tube axis).
Instead of a builder, you can use a source to load an existing file:
Or download from a URL:
Transforms¶
After the geometry is built, you can chain transforms. For example, to triple the unit cell along the tube axis:
[[geometry.transforms]]
type = "supercell"
repeat = [1, 1, 3]
[[geometry.transforms]]
type = "perturb"
stddev = 0.05
Transforms are applied in order. Available transforms: supercell, vacuum,
perturb, strain, rotate, set_pbc.
[calculator] — the energy/force engine¶
EMT (Effective Medium Theory) is a simple force field built into ASE. It requires no extra dependencies and works for metals — perfect for tests and demonstrations.
For real science, switch to:
pixi install -e science.
pixi install -e mace.
See Tutorial 7.
[sampling] — exploring configuration space¶
[sampling]
type = "md" # (1)
temperature = 300.0 # (2)
steps = 50 # (3)
interval = 10 # (4)
timestep = 1.0 # (5)
-
type = "md"— Langevin NVT molecular dynamics via ASE. Other options:rattle(random displacements, good for periodic solids) andmonte_carlo(rigid-body moves + conformer swaps, ideal for molecules on surfaces). -
temperature— in Kelvin. Higher temperature explores further from equilibrium and generates more diverse frames. -
steps— total MD steps. 50 steps × 1 fs timestep = 50 fs of dynamics. -
interval— save a frame everyintervalsteps. Here: 50 / 10 = 5 frames (plus the initial frame = 6 total). -
timestep— in femtoseconds. 1–2 fs is typical.
How many frames to generate?
Generate roughly 5–10× more candidates than your final budget. The selection funnel will pick the most diverse, physically valid subset.
[selection] — the quality filter¶
This is where TrainCraft earns its name. Before any expensive DFT labeling, the funnel removes bad and redundant frames:
[selection]
steps = ["physicality", "dedup", "diversity"] # (1)
budget = 3 # (2)
min_distance = 0.7 # (3)
steps— the ordered list of filter stages:physicality: drops frames where any two atoms are closer thanmin_distanceÅ (catches MD crashes, overlapping atoms).dedup: removes exact duplicates by content hash.-
diversity: farthest-point sampling over a histogram descriptor — keeps the most structurally diverse subset. -
budget— the maximum number of frames to keep after all filters. Here we keep 3 out of 6 candidates. -
min_distance— the physicality threshold in Å. 0.7 Å is conservative (almost any pair of atoms closer than this is unphysical).
Reorder the funnel for speed
Run physicality and dedup first — they're cheap. Only then run
diversity (FPS), which is O(n²) in the number of surviving frames.
[dataset] — persistent, hash-deduped storage¶
TrainCraft writes runs/01_cnt_emt_md/dataset.extxyz. If you run the pipeline
multiple times (e.g., after updating the geometry), frames are appended and
deduplicated by content hash — you never get exact duplicates across runs.
Inspecting the results¶
from traincraft import read_frames
frames = read_frames("runs/01_cnt_emt_md/dataset.extxyz")
print(f"Got {len(frames)} frames")
for f in frames:
print(f.provenance.source, f.provenance.transforms)
# → "builder:nanotube:5-0-l1" []
The extxyz file is also readable by any ASE-compatible tool (OVITO, ASE, etc.):
Summary¶
| Stage | What happened |
|---|---|
| geometry | Built a (5,0) CNT with 6 Å vacuum in x/y |
| sampling | 50 steps of Langevin MD at 300 K → 6 candidate frames |
| selection | physicality → dedup → diversity kept 3 frames |
| dataset | Written to dataset.extxyz with provenance |
Next: Tutorial 2 — building molecules from chemical names and SMILES strings.