Training Data Engine That Proves Its Own Worth

Forge generates training data from documents, evaluates it with rubric-based scoring, detects benchmark contamination, fine-tunes a model locally, and measures improvement with statistical significance. End to end, one command.

1Generate
2Quality
3Dedup
4Judge
5Contamination
6Difficulty
7Split
8Fine-tune
9Benchmark

Results from a Real Run

Pipeline executed on 3 technical ML documents (~7,500 words total). Generated 200 training examples across 4 task types, evaluated every one, trained a 0.5B model locally, and benchmarked the result.

200 Examples generated
4.77 Avg judge score (out of 5)
+16.8% ROUGE-1 improvement
0 Duplicates found
p = 0.0 Statistical significance

Base Model vs Fine-Tuned

Qwen 2.5 0.5B, before and after LoRA fine-tuning on the generated data. 40 held-out test examples. Significance tested with paired bootstrap (1000 resamples).

ROUGE-1
0.418
0.586
+16.8%
ROUGE-2
0.185
0.311
+12.7%
ROUGE-L
0.289
0.417
+12.9%

Gray = base model. Blue = fine-tuned. Green = improvement delta.


Training Dynamics

Loss Curve

IterTrain LossEvent
1-Val loss: 2.345
51.922
201.245
400.783Checkpoint saved
600.350
800.346Checkpoint saved
1000.176
1200.144Val loss: 0.730

Configuration

ModelQwen 2.5 0.5B Instruct
MethodLoRA (rank 8, 16 layers)
Trainable params2.9M / 494M (0.59%)
Training time5 minutes (Apple Silicon)
Peak memory9.0 GB
Epochs3
Batch size4
Learning rate1e-4

Data Quality Breakdown

LLM-as-Judge Scores

GPT-4.1-mini scored every example on 4 dimensions: faithfulness, helpfulness, complexity, coherence.

RatingCountFraction
5 / 517386%
4 / 52714%
3 / 5 or below00%

Difficulty Distribution

Heuristic calibration based on output length, vocabulary complexity, and reasoning depth.

LevelCountFraction
Easy21%
Medium7638%
Hard12261%

Sample Generated Examples

Real examples from the pipeline output, unedited.

Input
Read the following passage and generate ONE useful question and its answer. Passage: Common evaluation metrics include: Accuracy (fraction of correct predictions), Precision (fraction of positive predictions that are correct), Recall (fraction of actual positives correctly identified), and F1 Score (harmonic mean of precision and recall).
Output
Question: What are some common evaluation metrics used for classification tasks, and what do they measure? Answer: Common evaluation metrics for classification tasks include Accuracy, which measures the fraction of correct predictions; Precision, which measures the fraction of positive predictions that are correct; Recall, which measures the fraction of actual positives correctly identified; and F1 Score, which is the harmonic mean of precision and recall.
Judge score: 4.75 / 5 | Difficulty: medium
Input
Summarise the following text: The authors introduce FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images.
Output
The text discusses the challenges of applying vision-language models to medical imaging due to domain shifts and complex expert knowledge. To address this, the authors introduce FLAIR, a pre-trained vision-language model designed for comprehensive retinal fundus image analysis. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images. FLAIR incorporates expert domain knowledge through descriptive textual prompts during pre-training and zero-shot evaluation.
Judge score: 4.75 / 5 | Difficulty: medium
Input
Based on the following passage, generate a realistic user instruction and then write the ideal response. Passage: Foundation models are large-scale machine learning models designed to be adaptable across a wide range of downstream tasks.
Output
INSTRUCTION: Can you explain what foundation models are and why they are important in fields like computer vision and natural language processing? RESPONSE: Foundation models are large-scale machine learning models that are designed to be adaptable across a wide range of downstream tasks. Unlike models trained for a specific purpose, foundation models serve as a versatile base that can be fine-tuned or adapted to various applications.
Judge score: 4.75 / 5 | Difficulty: hard
Input
Based on the following passage, create a question that requires multi-step reasoning to answer. Then provide a detailed chain-of-thought answer. Passage: The Chinchilla scaling laws suggest that a model with N parameters should be trained on approximately 20N tokens for optimal compute efficiency.
Output
QUESTION: If a Transformer model has 5 billion parameters, how many training tokens should it be trained on for optimal compute efficiency, and why does this approach differ from previous practices? REASONING: Step 1: Identify the relationship. The passage states a model with N parameters should be trained on approximately 20N tokens. Step 2: Calculate. For N = 5 billion: 20 x 5 billion = 100 billion tokens. Step 3: Compare to previous practice. Earlier approaches trained increasingly large models on fixed datasets, rather than scaling data proportionally. ANSWER: 100 billion tokens, because the Chinchilla laws show model size and training data should scale together.
Judge score: 5.0 / 5 | Difficulty: hard

Pipeline Timing

StepTimeNotes
Generate (GPT-4.1-mini)9 min 31 sec200 examples, 4 task types, 162 chunks
Quality scoring0.6 secHeuristic flags (no API calls)
Deduplication0.4 secHash-based, zero duplicates
LLM-as-Judge17 min 5 sec200 examples x 4 scoring dimensions
Difficulty calibration0.6 secHeuristic (no API calls)
Train/test split0.3 sec80/20 stratified by task type
LoRA fine-tuning5 min 0 sec120 iterations, Qwen 2.5 0.5B, Apple Silicon
Benchmark2 min 25 sec40 test examples, base + fine-tuned inference
Total~35 minCost: ~$5

Try It

git clone https://github.com/pugalenthi0928/training-data-factory.git
cd training-data-factory
make install

# Dry run (no API key needed)
make forge

# Full pipeline
export OPENAI_API_KEY=sk-...
make forge-live
Read the overview for a plain-English explanation, or the technical deep-dive for implementation details.