What Forge Does (And Why It Matters)

A plain-English explanation of the project, the problem it solves, and the approach it takes. No machine learning background required.

The Analogy

Think about how a school prepares students for an exam. The teacher does not hand them a stack of random textbook pages and say "good luck." There is a process. The teacher reads the material, designs practice exercises at different difficulty levels, reviews those exercises for errors, removes any that are duplicates, checks that the practice questions do not overlap with the actual exam, and then has the students study. After studying, the teacher gives a test to measure how much they learned.

Forge does exactly this, but for AI models instead of students.

The "textbooks" are your documents. The "practice exercises" are training examples that Forge generates automatically. The "teacher review" is a quality evaluation pipeline. The "study session" is model fine-tuning. And the "before-and-after test" is a benchmark that proves the training actually worked.

The Problem

AI language models learn to perform specific tasks by studying examples. If you want a model to answer medical questions, you show it thousands of medical Q&A pairs. If you want it to summarize legal documents, you show it examples of legal summaries. These examples are called training data, and their quality directly controls how well the model performs.

The problem is that most people who build training data take shortcuts. They generate a batch of question-answer pairs from their documents, skim through a handful to check they look reasonable, and call it done. There is no systematic quality review, no check for duplicate or contaminated data, no measurement of whether the data actually improves anything. It is a "trust me, it's fine" approach.

This is like a teacher writing an exam prep booklet, never proofreading it, handing it to students, and then claiming it helped without ever giving a test.

What Forge Does

Forge is a pipeline that takes documents as input and produces verified, high-quality training data as output. It does not stop at generation. It evaluates, filters, and ultimately proves that the data works by training a model and measuring the improvement.

Here is each step, in order:

Step 1: Read and break down the documents

Forge reads your source documents (text files, PDFs, web pages) and splits them into smaller chunks. It does not split blindly by paragraph count. It looks at the structure of the text, respecting section headers, lists, and tables, so that each chunk is a coherent unit that makes sense on its own.

Step 2: Generate training exercises

For each chunk, Forge generates multiple types of training examples using a language model (GPT-4.1-mini by default). Instead of just Q&A, it creates four types of exercises:

Questions and answers that test comprehension of the material
Summaries that condense the key points
Instructions and responses that simulate real user requests
Chain-of-thought reasoning that requires multi-step problem solving

This variety matters. A model trained on diverse exercise types becomes more versatile than one trained only on Q&A pairs.

Step 3: Check for obvious quality issues

Forge runs each example through a set of heuristic checks: is the output empty? Is it too short for the task type? Does it contain common refusal phrases like "as an AI language model, I cannot..."? Is it repetitive? Each example gets a quality score and a list of flags describing any issues found.

Step 4: Remove duplicates

Duplicate or near-duplicate examples bias the model toward certain patterns and waste training budget. Forge identifies and removes duplicates using hash-based comparison.

Step 5: Have an AI reviewer grade every example

This is one of the most important steps. A separate language model (acting as a "judge") reads each example and scores it on four dimensions:

Faithfulness: Does the output accurately reflect the source material?
Helpfulness: Would this example actually teach a model something useful?
Complexity: Is the example substantive enough to be worth training on?
Coherence: Is the output well-structured and easy to follow?

Each dimension gets a score from 1 to 5 with a written explanation. In the demo run, the average score was 4.77 out of 5, and 86% of examples scored a perfect 5/5.

Step 6: Check for exam leakage

This is a problem that most training data projects ignore entirely. If your training data accidentally overlaps with standard evaluation benchmarks (the tests used to measure AI performance), then your benchmark results become meaningless. It is like giving students the answer key before the exam and then claiming they aced it.

Forge checks for this by comparing the text of every training example against known benchmark datasets using n-gram matching. Any overlap gets flagged.

Step 7: Tag difficulty levels

Each example gets classified as easy, medium, or hard based on characteristics like output length, vocabulary complexity, and reasoning depth. This allows you to balance the training set by difficulty or use curriculum-based training strategies where the model starts with easier examples and progresses to harder ones.

Step 8: Split into training and test sets

The data is split into a training set (80%) and a held-out test set (20%). The split is stratified, meaning each task type is represented equally in both sets. The test set is never used for training. It exists only for evaluation.

Step 9: Train a model

Forge fine-tunes a small language model (Qwen 2.5 with 500 million parameters) locally on your machine using LoRA, a technique that trains only a tiny fraction of the model's parameters (0.59% in this case) while keeping the rest frozen. This runs entirely on Apple Silicon hardware with no cloud GPU needed. In the demo run, training took 5 minutes and used 9 GB of memory.

Step 10: Prove it worked

The final step runs both the original model and the fine-tuned model on the held-out test set and compares their outputs against the reference answers. Forge computes ROUGE scores (a standard metric that measures overlap between generated text and reference text) and runs a statistical significance test to confirm the improvement is real and not due to random chance.

The Results

In the demo run:

What was measured	Before training	After training	Change
ROUGE-1 (word overlap)	41.8%	58.6%	+16.8%
ROUGE-2 (phrase overlap)	18.5%	31.1%	+12.7%
ROUGE-L (longest match)	28.9%	41.7%	+12.9%

The significance test returned p = 0.0, meaning there is essentially zero probability that this improvement happened by chance. The training data genuinely made the model better at the target domain.

Why This Matters

Most training data projects end at step 2. They generate data and assume it is good. Forge goes further in several ways that matter for real-world reliability:

Verification over assumption. Every example is scored by an independent reviewer. You know exactly what your data quality looks like before you spend time training on it.

Contamination awareness. Benchmark contamination is a widespread problem in the field. Checking for it is straightforward, but almost nobody does it. Forge does.

Closed loop. The most convincing evidence that training data works is a model that demonstrably improves after training on it. Forge closes this loop automatically, with statistical significance testing to back up the claim.

Reproducibility. Every run produces a timestamped directory with the full pipeline log, configuration, intermediate artifacts, and final results. You can compare runs, inspect individual examples, and trace any output back to its source.

For implementation details, read the technical deep-dive. To see the numbers and sample data, check the demo page.