How Forge Is Built

Architecture decisions, component design, trade-offs, and lessons learned. Written to the level of detail where you could rebuild any part from scratch.

Motivation and Design Goals

The idea for Forge came from a pattern I kept seeing in training data projects: people generate synthetic data, glance at a few examples, and ship it. Nobody checks if the data is actually good. Nobody trains a model on it to prove it works. Nobody looks for benchmark contamination. The result is a pipeline that produces output but offers no evidence that the output is useful.

I wanted to build something different. Not just a data generator, but a system that evaluates its own output and closes the loop by fine-tuning a model and measuring the result. The design goals were:

Multi-strategy generation (not just Q&A)
Automated quality evaluation with rubric-based scoring
Benchmark contamination detection
Local model fine-tuning (zero cloud GPU cost)
Statistical significance testing on benchmark results
Full reproducibility with experiment tracking

System Architecture

Forge is a Python library (src/training_data_robo/) with thin CLI wrappers in scripts/. The library handles all the logic; the scripts handle argument parsing and call into the library. This separation matters for testability. You can unit test the library code without touching CLI arguments or file I/O.

The pipeline orchestrator (pipeline.py) wires everything together as a directed acyclic graph (DAG). Each step declares its dependencies, and the runner resolves execution order with topological sort. Steps cache their outputs, so if you re-run after a failure, only the failed and downstream steps execute.

src/training_data_robo/

bot.py # orchestrator: loads docs, chunks, generates, exports

cli.py # CLI entry point with task templates

models.py # domain models: TaskType, TrainingExample, TextChunk

chunking.py # structure-aware document chunking

task_selector.py # maps chunk metadata to task types

quality.py # heuristic quality filters

judge.py # LLM-as-judge rubric evaluation

contamination.py # n-gram overlap detection

difficulty.py # easy/medium/hard calibration

diversity.py # vocabulary and task diversity metrics

selector.py # data selection strategies

pipeline.py # DAG runner with caching and resume

tracker.py # experiment tracking (runs/ directory)

io.py # JSONL read/write utilities

ai_client.py # LLM client abstraction

Document Processing

Source loading

The unified loader (sources/unified.py) accepts a directory path and dispatches to format-specific readers for text files, PDFs, and web pages. Each reader returns a list of Document objects with a text field and a metadata dict containing the source path and format. This abstraction lets you add new formats without touching the pipeline.

Structure-aware chunking

Naive chunking splits text at fixed character boundaries or paragraph breaks. This produces chunks that start mid-sentence or split a numbered list in half. Forge's chunker (chunking.py) does something better: it scans for markdown headers, list markers, and horizontal rules, and uses those as preferred split points.

Each chunk carries metadata about its structure: section_title, section_level, and chunk_type (prose, list, table, or mixed). This metadata feeds into the task selector downstream.

The chunker takes two parameters: max_chars (the target chunk size, default 900) and overlap (the number of characters to repeat between consecutive chunks, default 100). The overlap ensures that content near chunk boundaries is not lost. I chose 900 characters as the default because it produces chunks of roughly 200-250 tokens, which is a comfortable input length for generation without overwhelming the context.

Adaptive task selection

Not every chunk is suitable for every task type. A 50-word chunk is too short for a meaningful summary but fine for a title generation task. A chunk containing a bulleted list maps naturally to key-point extraction but poorly to chain-of-thought reasoning.

The task selector (task_selector.py) maps chunk metadata to task types. Prose chunks with at least 150 characters get all four task types (QA, summary, instruction, chain-of-thought). List chunks get key-point extraction and classification. Short chunks (under 150 characters) get only title generation. This produces a more diverse and naturally balanced dataset than assigning all task types to all chunks.

Multi-Strategy Generation

Forge generates four types of training examples, each using a different prompt template defined in cli.py:

Task	What it produces	Why it matters
Q&A	A question about the passage and its answer	Tests factual comprehension, the most common training data format
Summary	A condensed version of the passage	Teaches the model to distill information, useful for many downstream tasks
Instruction	A realistic user request and the ideal response	Trains instruction-following behavior, closer to real usage patterns
Chain-of-thought	A reasoning question with step-by-step solution	Develops multi-step reasoning, the hardest skill to train

The generation model (GPT-4.1-mini by default) receives the chunk text and a structured prompt template. The template specifies the expected output format so that the response can be parsed reliably. For example, the instruction template asks for output in the format INSTRUCTION: ... RESPONSE: ....

LLM client abstraction

The ai_client.py module defines two implementations of the same interface: OpenAILLMClient for real API calls and DummyLLMClient for testing. The dummy client returns deterministic, syntactically valid responses that pass downstream processing without hitting any API. This lets the entire pipeline run in "dry-run" mode for development and CI.

The bot (bot.py) automatically selects the client based on whether OPENAI_API_KEY is set, or you can pass --fake-model to force the dummy client.

Quality Assurance Pipeline

Quality evaluation happens in multiple layers. Each layer catches different types of problems, and they are ordered from cheapest to most expensive.

Layer 1: Heuristic quality scoring

The first pass (postprocess_quality.py calling into quality.py) runs zero-cost checks on every example:

Empty output detection. Penalizes blank or whitespace-only outputs (score -0.5).
Length validation. Each task type has a minimum output length. Summaries require at least 80 characters, Q&A at least 10. Outputs below the threshold get a short_output flag (score -0.2).
Refusal detection. Matches against a list of common refusal phrases ("as an AI language model", "I cannot", "I'm unable to"). Refusals indicate the generation model refused the task, producing useless training data (score -0.3).
Repetition detection. Counts token frequencies. If any single token accounts for more than 50% of a 20+ token output, it is flagged as repetitive_output (score -0.2). This catches degenerate outputs like "the the the the...".
Grounding check. For Q&A examples with both an answer and a context, it measures token overlap between the answer and context. If fewer than 20% of answer tokens appear in the context, it is flagged as weak_grounding (score -0.2). This catches hallucinated answers that ignore the source material.

Each example gets a quality score between 0.0 and 1.0, and a list of flags describing what (if anything) went wrong. These heuristics are fast and free. They catch the obvious problems before the expensive LLM judge runs.

Layer 2: LLM-as-Judge

The second pass (judge.py) sends each example to GPT-4.1-mini for rubric-based evaluation. The judge scores four dimensions on a 1 to 5 scale:

Faithfulness (1-5): Does the output accurately reflect the source? No hallucination?
Helpfulness (1-5): Would a person find this useful?
Complexity (1-5): Is the example substantive and non-trivial?
Coherence (1-5): Is the output well-structured, clear, logical?

For each dimension, the judge returns both a score and a one-sentence explanation. The explanations are stored in the output JSONL so you can inspect why any example got a low score.

Why GPT-4.1-mini instead of GPT-4? Cost. Judging 200 examples across 4 dimensions is 800 LLM calls. With GPT-4.1-mini this costs roughly $1-2. With GPT-4 it would be $15-20 for the same task. The quality of judgments from 4.1-mini is good enough for scoring. Where you need the strongest model is generation, not evaluation.

The judge processes examples in batches of 10 with sequential API calls. I initially considered async concurrency, but the rate limits on the API made sequential processing more predictable and easier to debug. At 10 examples per batch with 4 dimensions each, the 200-example run took about 17 minutes.

Layer 3: Deduplication

Duplicate examples bias the model toward certain patterns and waste the training budget. Forge supports two deduplication methods:

Hash-based (compute_dedupe.py): Converts each output to a bag-of-words representation, sorts the tokens, and hashes the result. Identical hashes mean identical content (up to word order). This is fast and deterministic.
Embedding-based: Uses text-embedding-3-small to compute vector representations and removes examples whose cosine similarity exceeds a threshold. This catches paraphrased duplicates that hash-based methods miss, at the cost of an API call.

The default is hash-based because it is free and catches exact duplicates. In the demo run, zero duplicates were found because the generation model produces sufficiently diverse outputs from different chunks.

Layer 4: Contamination detection

Benchmark contamination is when training data overlaps with evaluation benchmarks. If you train on MMLU questions and then evaluate on MMLU, the results are meaningless. This problem is surprisingly common in practice and rarely checked for.

The contamination detector (contamination.py) works by n-gram matching. It downloads benchmark datasets (MMLU, ARC, HellaSwag) on first run and caches them locally. It builds indexes of 8-grams and 13-grams from the benchmark text, then checks every training example for matching n-grams.

Why n-grams instead of embedding similarity? Precision. N-gram matching produces zero false positives: if an 8-gram from your training data appears verbatim in MMLU, that is a real overlap. Embedding similarity would flag semantically similar but textually different content, leading to over-flagging. The trade-off is that n-gram matching does not catch paraphrased contamination, but for the purpose of building a training data pipeline, the precision of exact matching is more useful than the recall of fuzzy matching.

Difficulty calibration

The difficulty module (difficulty.py) tags each example as easy, medium, or hard based on a scoring function. The scoring considers:

Output length (longer outputs tend to be harder)
Vocabulary complexity (type-token ratio, average word length)
Task type (chain-of-thought is harder than title generation)
Reasoning indicators (presence of "step", "because", "therefore")

This is a heuristic, not a ground truth. It exists to enable curriculum-based training strategies and to give a rough breakdown of dataset composition. In the demo run, the distribution was 1% easy, 38% medium, 61% hard, which makes sense given that the source material was technical ML content.

Data Selection and Splitting

The selector (selector.py) implements four strategies for choosing which examples to include in the final training set:

Quality-weighted: Ranks by judge score and selects the top N. Simple and effective.
Diverse: Greedily selects examples that maximize vocabulary diversity. Avoids redundancy.
Balanced: Ensures equal representation across task types. Prevents the model from over-fitting to any single task.
Curriculum: Selects a mix of difficulty levels, biased toward medium difficulty with smaller fractions of easy and hard. Based on the intuition that training is most efficient on examples that are challenging but not impossibly hard.

The train/test split (split_dataset.py) uses stratified sampling to ensure the same distribution of task types in both sets. The default is 80% train / 20% test. The test set is never used for training, selection, or any other pipeline step except the final benchmark.

Fine-Tuning

Why LoRA on a small model

The goal of the fine-tuning step is not to produce a production model. It is to prove that the generated training data causes measurable improvement on a real model. For this purpose, a small model (Qwen 2.5 0.5B, 494 million parameters) fine-tuned with LoRA is ideal:

It trains in 5 minutes on a laptop with no cloud GPU
LoRA trains only 0.59% of parameters (2.9M out of 494M), so the memory footprint is manageable (9 GB peak)
The model is small enough that 200 training examples can produce visible improvement, whereas a 7B+ model would need thousands
If the data improves a small model, it will also improve a larger one. The proof-of-concept transfers.

MLX on Apple Silicon

I chose MLX (Apple's machine learning framework) because it runs natively on Apple Silicon with unified memory, eliminating the need for a GPU server. The finetune_mlx.py script handles the full workflow:

Load the training JSONL and convert each example to chat format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Write the formatted data to a directory with train.jsonl, valid.jsonl, and test.jsonl
Invoke mlx_lm.lora with the configured hyperparameters
Save the adapter weights and training metrics

The default configuration uses rank 8, 16 layers, batch size 4, learning rate 1e-4, and 3 epochs. The number of iterations is calculated as epochs * (train_examples / batch_size). Checkpoints are saved every 40 iterations.

Training dynamics

In the demo run, training loss dropped from 1.9 to 0.14 over 120 iterations. Validation loss went from 2.3 to 0.73. The gap between training and validation loss (0.14 vs 0.73) suggests some overfitting, which is expected with a small dataset and 3 epochs. For a proof-of-concept, this is fine. In production, you would use more data, fewer epochs, or stronger regularization.

Benchmarking

Metrics

The benchmark script (benchmark.py) computes three standard metrics:

ROUGE-1: Unigram overlap between generated text and reference. Measures whether the model uses the right words.
ROUGE-2: Bigram overlap. Measures whether the model produces the right phrases.
ROUGE-L: Longest common subsequence. Measures structural similarity independent of word order.

Exact match is also computed but was 0% for both models, which is expected. Open-ended generation tasks rarely produce exact matches even when the content is correct.

Paired bootstrap significance test

Showing that ROUGE-L went from 0.289 to 0.417 is not enough. You need to know if the difference is statistically significant. Forge uses a paired bootstrap test:

Compute per-example ROUGE-L scores for both models
Calculate the observed delta (mean difference)
Randomly flip the sign of each delta 1000 times (the null hypothesis that there is no real difference)
Count how often the shuffled delta exceeds the observed delta
The fraction is the p-value

A p-value below 0.05 means the improvement is statistically significant. In the demo run, p = 0.0 (none of the 1000 shuffled deltas exceeded the observed one), indicating the improvement is real with very high confidence.

Pipeline Orchestration

DAG runner

The pipeline runner (pipeline.py) models the pipeline as a directed acyclic graph. Each step is a PipelineStep with a name, command, list of dependencies, and expected output files. The runner:

Resolves execution order with topological sort (Kahn's algorithm)
Checks for cached outputs before executing each step
Runs each step as a subprocess with timing
Records success/failure and timing in a pipeline log
Continues with remaining steps even if one fails (unless downstream steps depend on the failed step)

The orchestrator script (run_forge.py) defines the step graph and passes it to the runner. If you re-run after a mid-pipeline failure, only the failed step and its dependents re-execute. The cache check is based on the existence of output files, not content hashing. This is a simplification: it means that if you change the generation prompt but not the output path, the cache will serve stale results. For a portfolio project, this trade-off is acceptable. For production, you would hash the step configuration as part of the cache key.

Experiment tracking

The tracker (tracker.py) is a lightweight alternative to MLflow or W&B. Each run creates a directory under runs/ with:

config.json: Full run configuration (model, hyperparameters, paths)
pipeline_log.json: Step-by-step execution log with timings
benchmark.json: Before/after comparison with significance test
All intermediate JSONL files (raw, quality, deduped, judged, difficulty, train, test)

The Streamlit dashboard (app.py) reads these directories and renders comparison views. You can select any two runs and compare their metrics side by side.

Testing

The test suite has 184 tests across 24 files. The approach varies by component:

Unit tests for pure functions (quality scoring, difficulty calibration, n-gram matching, ROUGE computation)
Property-based tests using Hypothesis for invariants that should hold for any input (quality scores are always between 0 and 1, deduplication never increases the dataset size, chunking always produces non-empty chunks)
Integration tests for the pipeline runner (build a small DAG, run it, verify execution order and caching behavior)
Mock-based tests for LLM-dependent code (the judge, the bot). These use DummyLLMClient or unittest.mock.patch to avoid real API calls.

Coverage is 72% overall. The gaps are in source loaders (PDF, web) and the CLI module, which are harder to unit test and less critical than the core pipeline logic. The core modules (io, quality, difficulty, diversity, selector, pipeline, tracker, task_selector, chunking) are all above 80%.

Limitations and What I Would Do Differently

Honest limitations

Fine-tuning model size. Qwen 2.5 0.5B is a proof of concept. A 0.5B model does not have the capacity to learn complex reasoning from 160 examples. The ROUGE improvement is real, but it mostly reflects pattern matching (the model learns to produce outputs that look like the training data). A 7B+ model with 1000+ examples would show more meaningful improvements.
Contamination detection scope. The n-gram method only catches verbatim overlap. A training example that paraphrases a benchmark question would not be flagged. Embedding-based contamination detection would catch more cases but at the cost of false positives.
LLM-as-judge bias. Using GPT-4.1-mini to judge data generated by GPT-4.1-mini introduces a self-evaluation bias. The judge may rate its own generation style higher than it deserves. Ideally, the judge would be a different model family, or you would calibrate against human ratings.
Single-domain evaluation. The benchmark only measures performance on the same domain as the training data. It does not check whether fine-tuning caused catastrophic forgetting on general-purpose tasks.

What I would change

Async generation. The current pipeline makes sequential API calls with some batching. Async concurrency with rate limiting would cut the generation and judging time by 3-5x.
Content-addressed caching. The pipeline cache checks file existence only. Hashing the step config and input content would make caching correct across configuration changes.
Cross-domain forgetting test. Run the fine-tuned model on a general-purpose benchmark (MMLU subset) to verify that domain-specific fine-tuning did not degrade general capabilities.
Human evaluation calibration. Have a human rate 50 examples, then measure correlation between human ratings and LLM judge ratings. This would give a confidence score for the automated evaluation.

Cost Breakdown

Component	Cost	Notes
Generation (200 examples, 4 tasks)	~$3-5	GPT-4.1-mini, 162 chunks
LLM-as-Judge (200 x 4 dimensions)	~$1-2	GPT-4.1-mini, 800 calls
Heuristic quality, dedup, difficulty	$0	No API calls
MLX fine-tuning	$0	Local Apple Silicon
Benchmark inference	$0	Local MLX
Total	~$5

The pipeline is designed to be cheap. The most expensive step is generation ($3-5), followed by judging ($1-2). Everything else is local computation. This makes it feasible to iterate quickly: change a prompt template, re-run, compare results.

For a plain-English explanation, read the overview. For live results and sample data, see the demo page. Source code is on GitHub.