Forge generates training data from documents, evaluates it with rubric-based scoring,
detects benchmark contamination, fine-tunes a model locally, and measures improvement
with statistical significance. End to end, one command.
1Generate
→
2Quality
→
3Dedup
→
4Judge
→
5Contamination
→
6Difficulty
→
7Split
→
8Fine-tune
→
9Benchmark
Results from a Real Run
Pipeline executed on 3 technical ML documents (~7,500 words total). Generated 200 training
examples across 4 task types, evaluated every one, trained a 0.5B model locally, and
benchmarked the result.
200Examples generated
4.77Avg judge score (out of 5)
+16.8%ROUGE-1 improvement
0Duplicates found
p = 0.0Statistical significance
Base Model vs Fine-Tuned
Qwen 2.5 0.5B, before and after LoRA fine-tuning on the generated data.
40 held-out test examples. Significance tested with paired bootstrap (1000 resamples).
ROUGE-1
0.418
0.586
+16.8%
ROUGE-2
0.185
0.311
+12.7%
ROUGE-L
0.289
0.417
+12.9%
Gray = base model. Blue = fine-tuned. Green = improvement delta.
Training Dynamics
Loss Curve
Iter
Train Loss
Event
1
-
Val loss: 2.345
5
1.922
20
1.245
40
0.783
Checkpoint saved
60
0.350
80
0.346
Checkpoint saved
100
0.176
120
0.144
Val loss: 0.730
Configuration
Model
Qwen 2.5 0.5B Instruct
Method
LoRA (rank 8, 16 layers)
Trainable params
2.9M / 494M (0.59%)
Training time
5 minutes (Apple Silicon)
Peak memory
9.0 GB
Epochs
3
Batch size
4
Learning rate
1e-4
Data Quality Breakdown
LLM-as-Judge Scores
GPT-4.1-mini scored every example on 4 dimensions:
faithfulness, helpfulness, complexity, coherence.
Rating
Count
Fraction
5 / 5
173
86%
4 / 5
27
14%
3 / 5 or below
0
0%
Difficulty Distribution
Heuristic calibration based on output length, vocabulary complexity,
and reasoning depth.
Level
Count
Fraction
Easy
2
1%
Medium
76
38%
Hard
122
61%
Sample Generated Examples
Real examples from the pipeline output, unedited.
Input
Read the following passage and generate ONE useful question and its answer.
Passage: Common evaluation metrics include: Accuracy (fraction of correct predictions), Precision (fraction of positive predictions that are correct), Recall (fraction of actual positives correctly identified), and F1 Score (harmonic mean of precision and recall).
Output
Question: What are some common evaluation metrics used for classification tasks, and what do they measure?
Answer: Common evaluation metrics for classification tasks include Accuracy, which measures the fraction of correct predictions; Precision, which measures the fraction of positive predictions that are correct; Recall, which measures the fraction of actual positives correctly identified; and F1 Score, which is the harmonic mean of precision and recall.
Judge score: 4.75 / 5 | Difficulty: medium
Input
Summarise the following text:
The authors introduce FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images.
Output
The text discusses the challenges of applying vision-language models to medical imaging due to domain shifts and complex expert knowledge. To address this, the authors introduce FLAIR, a pre-trained vision-language model designed for comprehensive retinal fundus image analysis. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images. FLAIR incorporates expert domain knowledge through descriptive textual prompts during pre-training and zero-shot evaluation.
Judge score: 4.75 / 5 | Difficulty: medium
Input
Based on the following passage, generate a realistic user instruction and then write the ideal response.
Passage: Foundation models are large-scale machine learning models designed to be adaptable across a wide range of downstream tasks.
Output
INSTRUCTION: Can you explain what foundation models are and why they are important in fields like computer vision and natural language processing?
RESPONSE: Foundation models are large-scale machine learning models that are designed to be adaptable across a wide range of downstream tasks. Unlike models trained for a specific purpose, foundation models serve as a versatile base that can be fine-tuned or adapted to various applications.
Judge score: 4.75 / 5 | Difficulty: hard
Input
Based on the following passage, create a question that requires multi-step reasoning to answer. Then provide a detailed chain-of-thought answer.
Passage: The Chinchilla scaling laws suggest that a model with N parameters should be trained on approximately 20N tokens for optimal compute efficiency.
Output
QUESTION: If a Transformer model has 5 billion parameters, how many training tokens should it be trained on for optimal compute efficiency, and why does this approach differ from previous practices?
REASONING:
Step 1: Identify the relationship. The passage states a model with N parameters should be trained on approximately 20N tokens.
Step 2: Calculate. For N = 5 billion: 20 x 5 billion = 100 billion tokens.
Step 3: Compare to previous practice. Earlier approaches trained increasingly large models on fixed datasets, rather than scaling data proportionally.
ANSWER: 100 billion tokens, because the Chinchilla laws show model size and training data should scale together.
Judge score: 5.0 / 5 | Difficulty: hard
Pipeline Timing
Step
Time
Notes
Generate (GPT-4.1-mini)
9 min 31 sec
200 examples, 4 task types, 162 chunks
Quality scoring
0.6 sec
Heuristic flags (no API calls)
Deduplication
0.4 sec
Hash-based, zero duplicates
LLM-as-Judge
17 min 5 sec
200 examples x 4 scoring dimensions
Difficulty calibration
0.6 sec
Heuristic (no API calls)
Train/test split
0.3 sec
80/20 stratified by task type
LoRA fine-tuning
5 min 0 sec
120 iterations, Qwen 2.5 0.5B, Apple Silicon
Benchmark
2 min 25 sec
40 test examples, base + fine-tuned inference
Total
~35 min
Cost: ~$5
Try It
git clone https://github.com/pugalenthi0928/training-data-factory.git
cd training-data-factory
make install
# Dry run (no API key needed)
make forge
# Full pipeline
export OPENAI_API_KEY=sk-...
make forge-live