Provenance¶
Every Structure in TrainCraft carries a Provenance record that answers
the question: "where did this frame come from, and how was it modified?"
Provenance is the foundation of reproducibility and dataset management.
The Provenance dataclass¶
@dataclass
class Provenance:
origin: str # (1)
source: str | None # (2)
transforms: list[str] # (3)
calculator: str | None # (4)
level_of_theory: dict # (5)
seed: int | None # (6)
parents: list[str] # (7)
extra: dict # (8)
origin— the most important field. One of four increasing-cost tags:"generated"— built from scratch (geometry only, no DFT)"ml_sampled"— frame produced by an MLIP-driven MD/MC run"ml_labeled"— frame labeled by an MLIP (not ground truth)-
"dft_labeled"— frame labeled by DFT (ground truth) -
source— a dotted string identifying what produced this frame. Examples:"builder:crystal:Cu","source:smiles:CCO","source:url:...". -
transforms— ordered list of transform names applied after the source/builder. Example:["supercell:(2,2,2)", "strain:hydro=0.02"]. -
calculator— which calculator produced thepropertiesdict. Set by the sampling engine after running a calculation. -
level_of_theory— DFT-specific metadata (functional, basis set, …). Reserved for Phase 2 (DFT labeling). -
seed— the RNG seed used for reproducible sampling. -
parents— hashes of parent structures. Enables tracking lineage across active learning iterations. -
extra— builder-specific metadata. SMILES builders store{"smiles": "canonical_smiles", "fragment_smiles": {"0": "..."}}. Defect builders store{"defects": [...]}.
How provenance flows through the pipeline¶
build_geometry()
→ Structure(provenance=Provenance(
origin="generated",
source="builder:crystal:Cu",
transforms=[],
))
apply_transform("supercell")
→ provenance.transforms.append("supercell:(2,2,2)")
apply_transform("strain")
→ provenance.transforms.append("strain:hydro=0.02")
run_sampling()
→ new Structure per frame, provenance.calculator = "emt"
provenance.parents = [initial_structure.hash]
The origin tag and dataset management¶
The origin tag is the key to separating cheap and expensive data:
from traincraft import Dataset
ds = Dataset("runs/my_run/dataset")
all_frames = ds.frames()
# Only DFT-labeled frames — use for training
dft_frames = ds.filter(origin="dft_labeled")
# Generated + sampled frames — candidates for labeling
unlabeled = ds.filter(origin=["generated", "ml_sampled"])
This separation is critical in an active-learning loop: you never want to mix DFT ground truth with ML-predicted labels in the same training batch.
Provenance in extxyz files¶
Provenance is stored in the tc_provenance key of each frame's info dict:
2
Lattice="..." Properties=... tc_provenance={"origin": "generated", "source": "builder:crystal:Cu", ...}
Cu 0.000 0.000 0.000
Cu 1.805 1.805 0.000
You can read it with ASE:
from ase.io import read
import json
frame = read("dataset.extxyz")
prov = json.loads(frame.info["tc_provenance"])
print(prov["source"]) # "builder:crystal:Cu"
print(prov["transforms"]) # ["supercell:(2,2,2)"]
Serialisation¶
d = prov.to_dict() # → plain Python dict (JSON-serializable)
prov2 = Provenance.from_dict(d) # → reconstruct from dict
Unknown keys in from_dict are silently ignored — older provenance records
remain readable as the schema evolves.