Architecture decisions, component design, trade-offs, and lessons learned. Written to the level of detail where you could rebuild any part from scratch.
The idea for Forge came from a pattern I kept seeing in training data projects: people generate synthetic data, glance at a few examples, and ship it. Nobody checks if the data is actually good. Nobody trains a model on it to prove it works. Nobody looks for benchmark contamination. The result is a pipeline that produces output but offers no evidence that the output is useful.
I wanted to build something different. Not just a data generator, but a system that evaluates its own output and closes the loop by fine-tuning a model and measuring the result. The design goals were:
Forge is a Python library (src/training_data_robo/) with thin CLI
wrappers in scripts/. The library handles all the logic; the scripts
handle argument parsing and call into the library. This separation matters for
testability. You can unit test the library code without touching CLI arguments or
file I/O.
The pipeline orchestrator (pipeline.py) wires everything together as a
directed acyclic graph (DAG). Each step declares its dependencies, and the runner
resolves execution order with topological sort. Steps cache their outputs, so if you
re-run after a failure, only the failed and downstream steps execute.
The unified loader (sources/unified.py) accepts a directory path and
dispatches to format-specific readers for text files, PDFs, and web pages. Each reader
returns a list of Document objects with a text field and a
metadata dict containing the source path and format. This abstraction
lets you add new formats without touching the pipeline.
Naive chunking splits text at fixed character boundaries or paragraph breaks. This
produces chunks that start mid-sentence or split a numbered list in half. Forge's
chunker (chunking.py) does something better: it scans for markdown
headers, list markers, and horizontal rules, and uses those as preferred split points.
Each chunk carries metadata about its structure: section_title,
section_level, and chunk_type (prose, list, table, or mixed).
This metadata feeds into the task selector downstream.
The chunker takes two parameters: max_chars (the target chunk size,
default 900) and overlap (the number of characters to repeat between
consecutive chunks, default 100). The overlap ensures that content near chunk boundaries
is not lost. I chose 900 characters as the default because it produces chunks of roughly
200-250 tokens, which is a comfortable input length for generation without overwhelming
the context.
Not every chunk is suitable for every task type. A 50-word chunk is too short for a meaningful summary but fine for a title generation task. A chunk containing a bulleted list maps naturally to key-point extraction but poorly to chain-of-thought reasoning.
The task selector (task_selector.py) maps chunk metadata to task types.
Prose chunks with at least 150 characters get all four task types (QA, summary,
instruction, chain-of-thought). List chunks get key-point extraction and classification.
Short chunks (under 150 characters) get only title generation. This produces a more
diverse and naturally balanced dataset than assigning all task types to all chunks.
Forge generates four types of training examples, each using a different prompt template
defined in cli.py:
| Task | What it produces | Why it matters |
|---|---|---|
| Q&A | A question about the passage and its answer | Tests factual comprehension, the most common training data format |
| Summary | A condensed version of the passage | Teaches the model to distill information, useful for many downstream tasks |
| Instruction | A realistic user request and the ideal response | Trains instruction-following behavior, closer to real usage patterns |
| Chain-of-thought | A reasoning question with step-by-step solution | Develops multi-step reasoning, the hardest skill to train |
The generation model (GPT-4.1-mini by default) receives the chunk text and a structured
prompt template. The template specifies the expected output format so that the response
can be parsed reliably. For example, the instruction template asks for output in the
format INSTRUCTION: ... RESPONSE: ....
The ai_client.py module defines two implementations of the same interface:
OpenAILLMClient for real API calls and DummyLLMClient for
testing. The dummy client returns deterministic, syntactically valid responses that
pass downstream processing without hitting any API. This lets the entire pipeline run
in "dry-run" mode for development and CI.
The bot (bot.py) automatically selects the client based on whether
OPENAI_API_KEY is set, or you can pass --fake-model to
force the dummy client.
Quality evaluation happens in multiple layers. Each layer catches different types of problems, and they are ordered from cheapest to most expensive.
The first pass (postprocess_quality.py calling into quality.py)
runs zero-cost checks on every example:
short_output flag (score -0.2).repetitive_output (score -0.2). This catches degenerate outputs like
"the the the the...".weak_grounding (score -0.2).
This catches hallucinated answers that ignore the source material.Each example gets a quality score between 0.0 and 1.0, and a list of flags describing what (if anything) went wrong. These heuristics are fast and free. They catch the obvious problems before the expensive LLM judge runs.
The second pass (judge.py) sends each example to GPT-4.1-mini for rubric-based
evaluation. The judge scores four dimensions on a 1 to 5 scale:
For each dimension, the judge returns both a score and a one-sentence explanation. The explanations are stored in the output JSONL so you can inspect why any example got a low score.
Why GPT-4.1-mini instead of GPT-4? Cost. Judging 200 examples across 4 dimensions is 800 LLM calls. With GPT-4.1-mini this costs roughly $1-2. With GPT-4 it would be $15-20 for the same task. The quality of judgments from 4.1-mini is good enough for scoring. Where you need the strongest model is generation, not evaluation.
The judge processes examples in batches of 10 with sequential API calls. I initially considered async concurrency, but the rate limits on the API made sequential processing more predictable and easier to debug. At 10 examples per batch with 4 dimensions each, the 200-example run took about 17 minutes.
Duplicate examples bias the model toward certain patterns and waste the training budget. Forge supports two deduplication methods:
compute_dedupe.py): Converts each output
to a bag-of-words representation, sorts the tokens, and hashes the result. Identical
hashes mean identical content (up to word order). This is fast and deterministic.The default is hash-based because it is free and catches exact duplicates. In the demo run, zero duplicates were found because the generation model produces sufficiently diverse outputs from different chunks.
Benchmark contamination is when training data overlaps with evaluation benchmarks. If you train on MMLU questions and then evaluate on MMLU, the results are meaningless. This problem is surprisingly common in practice and rarely checked for.
The contamination detector (contamination.py) works by n-gram matching.
It downloads benchmark datasets (MMLU, ARC, HellaSwag) on first run and caches them
locally. It builds indexes of 8-grams and 13-grams from the benchmark text, then
checks every training example for matching n-grams.
Why n-grams instead of embedding similarity? Precision. N-gram matching produces zero false positives: if an 8-gram from your training data appears verbatim in MMLU, that is a real overlap. Embedding similarity would flag semantically similar but textually different content, leading to over-flagging. The trade-off is that n-gram matching does not catch paraphrased contamination, but for the purpose of building a training data pipeline, the precision of exact matching is more useful than the recall of fuzzy matching.
The difficulty module (difficulty.py) tags each example as easy, medium,
or hard based on a scoring function. The scoring considers:
This is a heuristic, not a ground truth. It exists to enable curriculum-based training strategies and to give a rough breakdown of dataset composition. In the demo run, the distribution was 1% easy, 38% medium, 61% hard, which makes sense given that the source material was technical ML content.
The selector (selector.py) implements four strategies for choosing which
examples to include in the final training set:
The train/test split (split_dataset.py) uses stratified sampling to
ensure the same distribution of task types in both sets. The default is 80% train /
20% test. The test set is never used for training, selection, or any other pipeline
step except the final benchmark.
The goal of the fine-tuning step is not to produce a production model. It is to prove that the generated training data causes measurable improvement on a real model. For this purpose, a small model (Qwen 2.5 0.5B, 494 million parameters) fine-tuned with LoRA is ideal:
I chose MLX (Apple's machine learning framework) because it runs natively on Apple
Silicon with unified memory, eliminating the need for a GPU server. The
finetune_mlx.py script handles the full workflow:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}train.jsonl,
valid.jsonl, and test.jsonlmlx_lm.lora with the configured hyperparameters
The default configuration uses rank 8, 16 layers, batch size 4, learning rate 1e-4,
and 3 epochs. The number of iterations is calculated as epochs * (train_examples / batch_size).
Checkpoints are saved every 40 iterations.
In the demo run, training loss dropped from 1.9 to 0.14 over 120 iterations. Validation loss went from 2.3 to 0.73. The gap between training and validation loss (0.14 vs 0.73) suggests some overfitting, which is expected with a small dataset and 3 epochs. For a proof-of-concept, this is fine. In production, you would use more data, fewer epochs, or stronger regularization.
The benchmark script (benchmark.py) computes three standard metrics:
Exact match is also computed but was 0% for both models, which is expected. Open-ended generation tasks rarely produce exact matches even when the content is correct.
Showing that ROUGE-L went from 0.289 to 0.417 is not enough. You need to know if the difference is statistically significant. Forge uses a paired bootstrap test:
A p-value below 0.05 means the improvement is statistically significant. In the demo run, p = 0.0 (none of the 1000 shuffled deltas exceeded the observed one), indicating the improvement is real with very high confidence.
The pipeline runner (pipeline.py) models the pipeline as a directed
acyclic graph. Each step is a PipelineStep with a name, command,
list of dependencies, and expected output files. The runner:
The orchestrator script (run_forge.py) defines the step graph and passes
it to the runner. If you re-run after a mid-pipeline failure, only the failed step and
its dependents re-execute. The cache check is based on the existence of output files,
not content hashing. This is a simplification: it means that if you change the
generation prompt but not the output path, the cache will serve stale results. For a
portfolio project, this trade-off is acceptable. For production, you would hash the
step configuration as part of the cache key.
The tracker (tracker.py) is a lightweight alternative to MLflow or W&B.
Each run creates a directory under runs/ with:
config.json: Full run configuration (model, hyperparameters, paths)pipeline_log.json: Step-by-step execution log with timingsbenchmark.json: Before/after comparison with significance test
The Streamlit dashboard (app.py) reads these directories and renders
comparison views. You can select any two runs and compare their metrics side by side.
The test suite has 184 tests across 24 files. The approach varies by component:
DummyLLMClient or unittest.mock.patch to avoid
real API calls.Coverage is 72% overall. The gaps are in source loaders (PDF, web) and the CLI module, which are harder to unit test and less critical than the core pipeline logic. The core modules (io, quality, difficulty, diversity, selector, pipeline, tracker, task_selector, chunking) are all above 80%.
| Component | Cost | Notes |
|---|---|---|
| Generation (200 examples, 4 tasks) | ~$3-5 | GPT-4.1-mini, 162 chunks |
| LLM-as-Judge (200 x 4 dimensions) | ~$1-2 | GPT-4.1-mini, 800 calls |
| Heuristic quality, dedup, difficulty | $0 | No API calls |
| MLX fine-tuning | $0 | Local Apple Silicon |
| Benchmark inference | $0 | Local MLX |
| Total | ~$5 |
The pipeline is designed to be cheap. The most expensive step is generation ($3-5), followed by judging ($1-2). Everything else is local computation. This makes it feasible to iterate quickly: change a prompt template, re-run, compare results.
For a plain-English explanation, read the overview. For live results and sample data, see the demo page. Source code is on GitHub.