How Forge Is Built
Architecture decisions, component design, trade-offs, and lessons learned. Written to the level of detail where you could rebuild any part from scratch.
Motivation and Design Goals
The idea for Forge came from a pattern I kept seeing in training data projects: people generate synthetic data, glance at a few examples, and ship it. Nobody checks if the data is actually good. Nobody trains a model on it to prove it works. Nobody looks for benchmark contamination. The result is a pipeline that produces output but offers no evidence that the output is useful.
I wanted to build something different. Not just a data generator, but a system that evaluates its own output and closes the loop by fine-tuning a model and measuring the result. The design goals were:
- Multi-strategy generation (not just Q&A)
- Automated quality evaluation with rubric-based scoring
- Benchmark contamination detection
- Local model fine-tuning (zero cloud GPU cost)
- Statistical significance testing on benchmark results
- Full reproducibility with experiment tracking
System Architecture
Forge is a Python library (src/training_data_robo/) with thin CLI
wrappers in scripts/. The library handles all the logic; the scripts
handle argument parsing and call into the library. This separation matters for
testability. You can unit test the library code without touching CLI arguments or
file I/O.
The pipeline orchestrator (pipeline.py) wires everything together as a
directed acyclic graph (DAG). Each step declares its dependencies, and the runner
resolves execution order with topological sort. Steps cache their outputs, so if you
re-run after a failure, only the failed and downstream steps execute.
Document Processing
Source loading
The unified loader (sources/unified.py) accepts a directory path and
dispatches to format-specific readers for text files, PDFs, and web pages. Each reader
returns a list of Document objects with a text field and a
metadata dict containing the source path and format. This abstraction
lets you add new formats without touching the pipeline.
Structure-aware chunking
Naive chunking splits text at fixed character boundaries or paragraph breaks. This
produces chunks that start mid-sentence or split a numbered list in half. Forge's
chunker (chunking.py) does something better: it scans for markdown
headers, list markers, and horizontal rules, and uses those as preferred split points.
Each chunk carries metadata about its structure: section_title,
section_level, and chunk_type (prose, list, table, or mixed).
This metadata feeds into the task selector downstream.
The chunker takes two parameters: max_chars (the target chunk size,
default 900) and overlap (the number of characters to repeat between
consecutive chunks, default 100). The overlap ensures that content near chunk boundaries
is not lost. I chose 900 characters as the default because it produces chunks of roughly
200-250 tokens, which is a comfortable input length for generation without overwhelming
the context.
Adaptive task selection
Not every chunk is suitable for every task type. A 50-word chunk is too short for a meaningful summary but fine for a title generation task. A chunk containing a bulleted list maps naturally to key-point extraction but poorly to chain-of-thought reasoning.
The task selector (task_selector.py) maps chunk metadata to task types.
Prose chunks with at least 150 characters get all four task types (QA, summary,
instruction, chain-of-thought). List chunks get key-point extraction and classification.
Short chunks (under 150 characters) get only title generation. This produces a more
diverse and naturally balanced dataset than assigning all task types to all chunks.
Multi-Strategy Generation
Forge generates four types of training examples, each using a different prompt template
defined in cli.py:
| Task | What it produces | Why it matters |
|---|---|---|
| Q&A | A question about the passage and its answer | Tests factual comprehension, the most common training data format |
| Summary | A condensed version of the passage | Teaches the model to distill information, useful for many downstream tasks |
| Instruction | A realistic user request and the ideal response | Trains instruction-following behavior, closer to real usage patterns |
| Chain-of-thought | A reasoning question with step-by-step solution | Develops multi-step reasoning, the hardest skill to train |
The generation model (GPT-4.1-mini by default) receives the chunk text and a structured
prompt template. The template specifies the expected output format so that the response
can be parsed reliably. For example, the instruction template asks for output in the
format INSTRUCTION: ... RESPONSE: ....
LLM client abstraction
The ai_client.py module defines two implementations of the same interface:
OpenAILLMClient for real API calls and DummyLLMClient for
testing. The dummy client returns deterministic, syntactically valid responses that
pass downstream processing without hitting any API. This lets the entire pipeline run
in "dry-run" mode for development and CI.
The bot (bot.py) automatically selects the client based on whether
OPENAI_API_KEY is set, or you can pass --fake-model to
force the dummy client.
Quality Assurance Pipeline
Quality evaluation happens in multiple layers. Each layer catches different types of problems, and they are ordered from cheapest to most expensive.
Layer 1: Heuristic quality scoring
The first pass (postprocess_quality.py calling into quality.py)
runs zero-cost checks on every example:
- Empty output detection. Penalizes blank or whitespace-only outputs (score -0.5).
- Length validation. Each task type has a minimum output length.
Summaries require at least 80 characters, Q&A at least 10. Outputs below the
threshold get a
short_outputflag (score -0.2). - Refusal detection. Matches against a list of common refusal phrases ("as an AI language model", "I cannot", "I'm unable to"). Refusals indicate the generation model refused the task, producing useless training data (score -0.3).
- Repetition detection. Counts token frequencies. If any single token
accounts for more than 50% of a 20+ token output, it is flagged as
repetitive_output(score -0.2). This catches degenerate outputs like "the the the the...". - Grounding check. For Q&A examples with both an answer and a context,
it measures token overlap between the answer and context. If fewer than 20% of answer
tokens appear in the context, it is flagged as
weak_grounding(score -0.2). This catches hallucinated answers that ignore the source material.
Each example gets a quality score between 0.0 and 1.0, and a list of flags describing what (if anything) went wrong. These heuristics are fast and free. They catch the obvious problems before the expensive LLM judge runs.
Layer 2: LLM-as-Judge
The second pass (judge.py) sends each example to GPT-4.1-mini for rubric-based
evaluation. The judge scores four dimensions on a 1 to 5 scale:
- Faithfulness (1-5): Does the output accurately reflect the source? No hallucination?
- Helpfulness (1-5): Would a person find this useful?
- Complexity (1-5): Is the example substantive and non-trivial?
- Coherence (1-5): Is the output well-structured, clear, logical?
For each dimension, the judge returns both a score and a one-sentence explanation. The explanations are stored in the output JSONL so you can inspect why any example got a low score.
Why GPT-4.1-mini instead of GPT-4? Cost. Judging 200 examples across 4 dimensions is 800 LLM calls. With GPT-4.1-mini this costs roughly $1-2. With GPT-4 it would be $15-20 for the same task. The quality of judgments from 4.1-mini is good enough for scoring. Where you need the strongest model is generation, not evaluation.
The judge processes examples in batches of 10 with sequential API calls. I initially considered async concurrency, but the rate limits on the API made sequential processing more predictable and easier to debug. At 10 examples per batch with 4 dimensions each, the 200-example run took about 17 minutes.
Layer 3: Deduplication
Duplicate examples bias the model toward certain patterns and waste the training budget. Forge supports two deduplication methods:
- Hash-based (
compute_dedupe.py): Converts each output to a bag-of-words representation, sorts the tokens, and hashes the result. Identical hashes mean identical content (up to word order). This is fast and deterministic. - Embedding-based: Uses text-embedding-3-small to compute vector representations and removes examples whose cosine similarity exceeds a threshold. This catches paraphrased duplicates that hash-based methods miss, at the cost of an API call.
The default is hash-based because it is free and catches exact duplicates. In the demo run, zero duplicates were found because the generation model produces sufficiently diverse outputs from different chunks.
Layer 4: Contamination detection
Benchmark contamination is when training data overlaps with evaluation benchmarks. If you train on MMLU questions and then evaluate on MMLU, the results are meaningless. This problem is surprisingly common in practice and rarely checked for.
The contamination detector (contamination.py) works by n-gram matching.
It downloads benchmark datasets (MMLU, ARC, HellaSwag) on first run and caches them
locally. It builds indexes of 8-grams and 13-grams from the benchmark text, then
checks every training example for matching n-grams.
Why n-grams instead of embedding similarity? Precision. N-gram matching produces zero false positives: if an 8-gram from your training data appears verbatim in MMLU, that is a real overlap. Embedding similarity would flag semantically similar but textually different content, leading to over-flagging. The trade-off is that n-gram matching does not catch paraphrased contamination, but for the purpose of building a training data pipeline, the precision of exact matching is more useful than the recall of fuzzy matching.
Difficulty calibration
The difficulty module (difficulty.py) tags each example as easy, medium,
or hard based on a scoring function. The scoring considers:
- Output length (longer outputs tend to be harder)
- Vocabulary complexity (type-token ratio, average word length)
- Task type (chain-of-thought is harder than title generation)
- Reasoning indicators (presence of "step", "because", "therefore")
This is a heuristic, not a ground truth. It exists to enable curriculum-based training strategies and to give a rough breakdown of dataset composition. In the demo run, the distribution was 1% easy, 38% medium, 61% hard, which makes sense given that the source material was technical ML content.
Data Selection and Splitting
The selector (selector.py) implements four strategies for choosing which
examples to include in the final training set:
- Quality-weighted: Ranks by judge score and selects the top N. Simple and effective.
- Diverse: Greedily selects examples that maximize vocabulary diversity. Avoids redundancy.
- Balanced: Ensures equal representation across task types. Prevents the model from over-fitting to any single task.
- Curriculum: Selects a mix of difficulty levels, biased toward medium difficulty with smaller fractions of easy and hard. Based on the intuition that training is most efficient on examples that are challenging but not impossibly hard.
The train/test split (split_dataset.py) uses stratified sampling to
ensure the same distribution of task types in both sets. The default is 80% train /
20% test. The test set is never used for training, selection, or any other pipeline
step except the final benchmark.
Fine-Tuning
Why LoRA on a small model
The goal of the fine-tuning step is not to produce a production model. It is to prove that the generated training data causes measurable improvement on a real model. For this purpose, a small model (Qwen 2.5 0.5B, 494 million parameters) fine-tuned with LoRA is ideal:
- It trains in 5 minutes on a laptop with no cloud GPU
- LoRA trains only 0.59% of parameters (2.9M out of 494M), so the memory footprint is manageable (9 GB peak)
- The model is small enough that 200 training examples can produce visible improvement, whereas a 7B+ model would need thousands
- If the data improves a small model, it will also improve a larger one. The proof-of-concept transfers.
MLX on Apple Silicon
I chose MLX (Apple's machine learning framework) because it runs natively on Apple
Silicon with unified memory, eliminating the need for a GPU server. The
finetune_mlx.py script handles the full workflow:
- Load the training JSONL and convert each example to chat format:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} - Write the formatted data to a directory with
train.jsonl,valid.jsonl, andtest.jsonl - Invoke
mlx_lm.lorawith the configured hyperparameters - Save the adapter weights and training metrics
The default configuration uses rank 8, 16 layers, batch size 4, learning rate 1e-4,
and 3 epochs. The number of iterations is calculated as epochs * (train_examples / batch_size).
Checkpoints are saved every 40 iterations.
Training dynamics
In the demo run, training loss dropped from 1.9 to 0.14 over 120 iterations. Validation loss went from 2.3 to 0.73. The gap between training and validation loss (0.14 vs 0.73) suggests some overfitting, which is expected with a small dataset and 3 epochs. For a proof-of-concept, this is fine. In production, you would use more data, fewer epochs, or stronger regularization.
Benchmarking
Metrics
The benchmark script (benchmark.py) computes three standard metrics:
- ROUGE-1: Unigram overlap between generated text and reference. Measures whether the model uses the right words.
- ROUGE-2: Bigram overlap. Measures whether the model produces the right phrases.
- ROUGE-L: Longest common subsequence. Measures structural similarity independent of word order.
Exact match is also computed but was 0% for both models, which is expected. Open-ended generation tasks rarely produce exact matches even when the content is correct.
Paired bootstrap significance test
Showing that ROUGE-L went from 0.289 to 0.417 is not enough. You need to know if the difference is statistically significant. Forge uses a paired bootstrap test:
- Compute per-example ROUGE-L scores for both models
- Calculate the observed delta (mean difference)
- Randomly flip the sign of each delta 1000 times (the null hypothesis that there is no real difference)
- Count how often the shuffled delta exceeds the observed delta
- The fraction is the p-value
A p-value below 0.05 means the improvement is statistically significant. In the demo run, p = 0.0 (none of the 1000 shuffled deltas exceeded the observed one), indicating the improvement is real with very high confidence.
Pipeline Orchestration
DAG runner
The pipeline runner (pipeline.py) models the pipeline as a directed
acyclic graph. Each step is a PipelineStep with a name, command,
list of dependencies, and expected output files. The runner:
- Resolves execution order with topological sort (Kahn's algorithm)
- Checks for cached outputs before executing each step
- Runs each step as a subprocess with timing
- Records success/failure and timing in a pipeline log
- Continues with remaining steps even if one fails (unless downstream steps depend on the failed step)
The orchestrator script (run_forge.py) defines the step graph and passes
it to the runner. If you re-run after a mid-pipeline failure, only the failed step and
its dependents re-execute. The cache check is based on the existence of output files,
not content hashing. This is a simplification: it means that if you change the
generation prompt but not the output path, the cache will serve stale results. For a
portfolio project, this trade-off is acceptable. For production, you would hash the
step configuration as part of the cache key.
Experiment tracking
The tracker (tracker.py) is a lightweight alternative to MLflow or W&B.
Each run creates a directory under runs/ with:
config.json: Full run configuration (model, hyperparameters, paths)pipeline_log.json: Step-by-step execution log with timingsbenchmark.json: Before/after comparison with significance test- All intermediate JSONL files (raw, quality, deduped, judged, difficulty, train, test)
The Streamlit dashboard (app.py) reads these directories and renders
comparison views. You can select any two runs and compare their metrics side by side.
Testing
The test suite has 184 tests across 24 files. The approach varies by component:
- Unit tests for pure functions (quality scoring, difficulty calibration, n-gram matching, ROUGE computation)
- Property-based tests using Hypothesis for invariants that should hold for any input (quality scores are always between 0 and 1, deduplication never increases the dataset size, chunking always produces non-empty chunks)
- Integration tests for the pipeline runner (build a small DAG, run it, verify execution order and caching behavior)
- Mock-based tests for LLM-dependent code (the judge, the bot).
These use
DummyLLMClientorunittest.mock.patchto avoid real API calls.
Coverage is 72% overall. The gaps are in source loaders (PDF, web) and the CLI module, which are harder to unit test and less critical than the core pipeline logic. The core modules (io, quality, difficulty, diversity, selector, pipeline, tracker, task_selector, chunking) are all above 80%.
Limitations and What I Would Do Differently
Honest limitations
- Fine-tuning model size. Qwen 2.5 0.5B is a proof of concept. A 0.5B model does not have the capacity to learn complex reasoning from 160 examples. The ROUGE improvement is real, but it mostly reflects pattern matching (the model learns to produce outputs that look like the training data). A 7B+ model with 1000+ examples would show more meaningful improvements.
- Contamination detection scope. The n-gram method only catches verbatim overlap. A training example that paraphrases a benchmark question would not be flagged. Embedding-based contamination detection would catch more cases but at the cost of false positives.
- LLM-as-judge bias. Using GPT-4.1-mini to judge data generated by GPT-4.1-mini introduces a self-evaluation bias. The judge may rate its own generation style higher than it deserves. Ideally, the judge would be a different model family, or you would calibrate against human ratings.
- Single-domain evaluation. The benchmark only measures performance on the same domain as the training data. It does not check whether fine-tuning caused catastrophic forgetting on general-purpose tasks.
What I would change
- Async generation. The current pipeline makes sequential API calls with some batching. Async concurrency with rate limiting would cut the generation and judging time by 3-5x.
- Content-addressed caching. The pipeline cache checks file existence only. Hashing the step config and input content would make caching correct across configuration changes.
- Cross-domain forgetting test. Run the fine-tuned model on a general-purpose benchmark (MMLU subset) to verify that domain-specific fine-tuning did not degrade general capabilities.
- Human evaluation calibration. Have a human rate 50 examples, then measure correlation between human ratings and LLM judge ratings. This would give a confidence score for the automated evaluation.
Cost Breakdown
| Component | Cost | Notes |
|---|---|---|
| Generation (200 examples, 4 tasks) | ~$3-5 | GPT-4.1-mini, 162 chunks |
| LLM-as-Judge (200 x 4 dimensions) | ~$1-2 | GPT-4.1-mini, 800 calls |
| Heuristic quality, dedup, difficulty | $0 | No API calls |
| MLX fine-tuning | $0 | Local Apple Silicon |
| Benchmark inference | $0 | Local MLX |
| Total | ~$5 |
The pipeline is designed to be cheap. The most expensive step is generation ($3-5), followed by judging ($1-2). Everything else is local computation. This makes it feasible to iterate quickly: change a prompt template, re-run, compare results.
For a plain-English explanation, read the overview. For live results and sample data, see the demo page. Source code is on GitHub.