How Forge Is Built

Architecture decisions, component design, trade-offs, and lessons learned. Written to the level of detail where you could rebuild any part from scratch.

Motivation and Design Goals

The idea for Forge came from a pattern I kept seeing in training data projects: people generate synthetic data, glance at a few examples, and ship it. Nobody checks if the data is actually good. Nobody trains a model on it to prove it works. Nobody looks for benchmark contamination. The result is a pipeline that produces output but offers no evidence that the output is useful.

I wanted to build something different. Not just a data generator, but a system that evaluates its own output and closes the loop by fine-tuning a model and measuring the result. The design goals were:

System Architecture

Forge is a Python library (src/training_data_robo/) with thin CLI wrappers in scripts/. The library handles all the logic; the scripts handle argument parsing and call into the library. This separation matters for testability. You can unit test the library code without touching CLI arguments or file I/O.

The pipeline orchestrator (pipeline.py) wires everything together as a directed acyclic graph (DAG). Each step declares its dependencies, and the runner resolves execution order with topological sort. Steps cache their outputs, so if you re-run after a failure, only the failed and downstream steps execute.

src/training_data_robo/
bot.py # orchestrator: loads docs, chunks, generates, exports
cli.py # CLI entry point with task templates
models.py # domain models: TaskType, TrainingExample, TextChunk
chunking.py # structure-aware document chunking
task_selector.py # maps chunk metadata to task types
quality.py # heuristic quality filters
judge.py # LLM-as-judge rubric evaluation
contamination.py # n-gram overlap detection
difficulty.py # easy/medium/hard calibration
diversity.py # vocabulary and task diversity metrics
selector.py # data selection strategies
pipeline.py # DAG runner with caching and resume
tracker.py # experiment tracking (runs/ directory)
io.py # JSONL read/write utilities
ai_client.py # LLM client abstraction

Document Processing

Source loading

The unified loader (sources/unified.py) accepts a directory path and dispatches to format-specific readers for text files, PDFs, and web pages. Each reader returns a list of Document objects with a text field and a metadata dict containing the source path and format. This abstraction lets you add new formats without touching the pipeline.

Structure-aware chunking

Naive chunking splits text at fixed character boundaries or paragraph breaks. This produces chunks that start mid-sentence or split a numbered list in half. Forge's chunker (chunking.py) does something better: it scans for markdown headers, list markers, and horizontal rules, and uses those as preferred split points.

Each chunk carries metadata about its structure: section_title, section_level, and chunk_type (prose, list, table, or mixed). This metadata feeds into the task selector downstream.

The chunker takes two parameters: max_chars (the target chunk size, default 900) and overlap (the number of characters to repeat between consecutive chunks, default 100). The overlap ensures that content near chunk boundaries is not lost. I chose 900 characters as the default because it produces chunks of roughly 200-250 tokens, which is a comfortable input length for generation without overwhelming the context.

Adaptive task selection

Not every chunk is suitable for every task type. A 50-word chunk is too short for a meaningful summary but fine for a title generation task. A chunk containing a bulleted list maps naturally to key-point extraction but poorly to chain-of-thought reasoning.

The task selector (task_selector.py) maps chunk metadata to task types. Prose chunks with at least 150 characters get all four task types (QA, summary, instruction, chain-of-thought). List chunks get key-point extraction and classification. Short chunks (under 150 characters) get only title generation. This produces a more diverse and naturally balanced dataset than assigning all task types to all chunks.

Multi-Strategy Generation

Forge generates four types of training examples, each using a different prompt template defined in cli.py:

TaskWhat it producesWhy it matters
Q&A A question about the passage and its answer Tests factual comprehension, the most common training data format
Summary A condensed version of the passage Teaches the model to distill information, useful for many downstream tasks
Instruction A realistic user request and the ideal response Trains instruction-following behavior, closer to real usage patterns
Chain-of-thought A reasoning question with step-by-step solution Develops multi-step reasoning, the hardest skill to train

The generation model (GPT-4.1-mini by default) receives the chunk text and a structured prompt template. The template specifies the expected output format so that the response can be parsed reliably. For example, the instruction template asks for output in the format INSTRUCTION: ... RESPONSE: ....

LLM client abstraction

The ai_client.py module defines two implementations of the same interface: OpenAILLMClient for real API calls and DummyLLMClient for testing. The dummy client returns deterministic, syntactically valid responses that pass downstream processing without hitting any API. This lets the entire pipeline run in "dry-run" mode for development and CI.

The bot (bot.py) automatically selects the client based on whether OPENAI_API_KEY is set, or you can pass --fake-model to force the dummy client.

Quality Assurance Pipeline

Quality evaluation happens in multiple layers. Each layer catches different types of problems, and they are ordered from cheapest to most expensive.

Layer 1: Heuristic quality scoring

The first pass (postprocess_quality.py calling into quality.py) runs zero-cost checks on every example:

Each example gets a quality score between 0.0 and 1.0, and a list of flags describing what (if anything) went wrong. These heuristics are fast and free. They catch the obvious problems before the expensive LLM judge runs.

Layer 2: LLM-as-Judge

The second pass (judge.py) sends each example to GPT-4.1-mini for rubric-based evaluation. The judge scores four dimensions on a 1 to 5 scale:

For each dimension, the judge returns both a score and a one-sentence explanation. The explanations are stored in the output JSONL so you can inspect why any example got a low score.

Why GPT-4.1-mini instead of GPT-4? Cost. Judging 200 examples across 4 dimensions is 800 LLM calls. With GPT-4.1-mini this costs roughly $1-2. With GPT-4 it would be $15-20 for the same task. The quality of judgments from 4.1-mini is good enough for scoring. Where you need the strongest model is generation, not evaluation.

The judge processes examples in batches of 10 with sequential API calls. I initially considered async concurrency, but the rate limits on the API made sequential processing more predictable and easier to debug. At 10 examples per batch with 4 dimensions each, the 200-example run took about 17 minutes.

Layer 3: Deduplication

Duplicate examples bias the model toward certain patterns and waste the training budget. Forge supports two deduplication methods:

The default is hash-based because it is free and catches exact duplicates. In the demo run, zero duplicates were found because the generation model produces sufficiently diverse outputs from different chunks.

Layer 4: Contamination detection

Benchmark contamination is when training data overlaps with evaluation benchmarks. If you train on MMLU questions and then evaluate on MMLU, the results are meaningless. This problem is surprisingly common in practice and rarely checked for.

The contamination detector (contamination.py) works by n-gram matching. It downloads benchmark datasets (MMLU, ARC, HellaSwag) on first run and caches them locally. It builds indexes of 8-grams and 13-grams from the benchmark text, then checks every training example for matching n-grams.

Why n-grams instead of embedding similarity? Precision. N-gram matching produces zero false positives: if an 8-gram from your training data appears verbatim in MMLU, that is a real overlap. Embedding similarity would flag semantically similar but textually different content, leading to over-flagging. The trade-off is that n-gram matching does not catch paraphrased contamination, but for the purpose of building a training data pipeline, the precision of exact matching is more useful than the recall of fuzzy matching.

Difficulty calibration

The difficulty module (difficulty.py) tags each example as easy, medium, or hard based on a scoring function. The scoring considers:

This is a heuristic, not a ground truth. It exists to enable curriculum-based training strategies and to give a rough breakdown of dataset composition. In the demo run, the distribution was 1% easy, 38% medium, 61% hard, which makes sense given that the source material was technical ML content.

Data Selection and Splitting

The selector (selector.py) implements four strategies for choosing which examples to include in the final training set:

The train/test split (split_dataset.py) uses stratified sampling to ensure the same distribution of task types in both sets. The default is 80% train / 20% test. The test set is never used for training, selection, or any other pipeline step except the final benchmark.

Fine-Tuning

Why LoRA on a small model

The goal of the fine-tuning step is not to produce a production model. It is to prove that the generated training data causes measurable improvement on a real model. For this purpose, a small model (Qwen 2.5 0.5B, 494 million parameters) fine-tuned with LoRA is ideal:

MLX on Apple Silicon

I chose MLX (Apple's machine learning framework) because it runs natively on Apple Silicon with unified memory, eliminating the need for a GPU server. The finetune_mlx.py script handles the full workflow:

  1. Load the training JSONL and convert each example to chat format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  2. Write the formatted data to a directory with train.jsonl, valid.jsonl, and test.jsonl
  3. Invoke mlx_lm.lora with the configured hyperparameters
  4. Save the adapter weights and training metrics

The default configuration uses rank 8, 16 layers, batch size 4, learning rate 1e-4, and 3 epochs. The number of iterations is calculated as epochs * (train_examples / batch_size). Checkpoints are saved every 40 iterations.

Training dynamics

In the demo run, training loss dropped from 1.9 to 0.14 over 120 iterations. Validation loss went from 2.3 to 0.73. The gap between training and validation loss (0.14 vs 0.73) suggests some overfitting, which is expected with a small dataset and 3 epochs. For a proof-of-concept, this is fine. In production, you would use more data, fewer epochs, or stronger regularization.

Benchmarking

Metrics

The benchmark script (benchmark.py) computes three standard metrics:

Exact match is also computed but was 0% for both models, which is expected. Open-ended generation tasks rarely produce exact matches even when the content is correct.

Paired bootstrap significance test

Showing that ROUGE-L went from 0.289 to 0.417 is not enough. You need to know if the difference is statistically significant. Forge uses a paired bootstrap test:

  1. Compute per-example ROUGE-L scores for both models
  2. Calculate the observed delta (mean difference)
  3. Randomly flip the sign of each delta 1000 times (the null hypothesis that there is no real difference)
  4. Count how often the shuffled delta exceeds the observed delta
  5. The fraction is the p-value

A p-value below 0.05 means the improvement is statistically significant. In the demo run, p = 0.0 (none of the 1000 shuffled deltas exceeded the observed one), indicating the improvement is real with very high confidence.

Pipeline Orchestration

DAG runner

The pipeline runner (pipeline.py) models the pipeline as a directed acyclic graph. Each step is a PipelineStep with a name, command, list of dependencies, and expected output files. The runner:

  1. Resolves execution order with topological sort (Kahn's algorithm)
  2. Checks for cached outputs before executing each step
  3. Runs each step as a subprocess with timing
  4. Records success/failure and timing in a pipeline log
  5. Continues with remaining steps even if one fails (unless downstream steps depend on the failed step)

The orchestrator script (run_forge.py) defines the step graph and passes it to the runner. If you re-run after a mid-pipeline failure, only the failed step and its dependents re-execute. The cache check is based on the existence of output files, not content hashing. This is a simplification: it means that if you change the generation prompt but not the output path, the cache will serve stale results. For a portfolio project, this trade-off is acceptable. For production, you would hash the step configuration as part of the cache key.

Experiment tracking

The tracker (tracker.py) is a lightweight alternative to MLflow or W&B. Each run creates a directory under runs/ with:

The Streamlit dashboard (app.py) reads these directories and renders comparison views. You can select any two runs and compare their metrics side by side.

Testing

The test suite has 184 tests across 24 files. The approach varies by component:

Coverage is 72% overall. The gaps are in source loaders (PDF, web) and the CLI module, which are harder to unit test and less critical than the core pipeline logic. The core modules (io, quality, difficulty, diversity, selector, pipeline, tracker, task_selector, chunking) are all above 80%.

Limitations and What I Would Do Differently

Honest limitations

What I would change

Cost Breakdown

ComponentCostNotes
Generation (200 examples, 4 tasks)~$3-5GPT-4.1-mini, 162 chunks
LLM-as-Judge (200 x 4 dimensions)~$1-2GPT-4.1-mini, 800 calls
Heuristic quality, dedup, difficulty$0No API calls
MLX fine-tuning$0Local Apple Silicon
Benchmark inference$0Local MLX
Total~$5

The pipeline is designed to be cheap. The most expensive step is generation ($3-5), followed by judging ($1-2). Everything else is local computation. This makes it feasible to iterate quickly: change a prompt template, re-run, compare results.


For a plain-English explanation, read the overview. For live results and sample data, see the demo page. Source code is on GitHub.