Training Data Engine That Proves Its Own Worth

Forge generates training data from documents, evaluates it with rubric-based scoring, detects benchmark contamination, fine-tunes a model locally, and measures improvement with statistical significance. End to end, one command.

1Generate

→

2Quality

→

3Dedup

→

4Judge

→

5Contamination

→

6Difficulty

→

7Split

→

8Fine-tune

→

9Benchmark

Results from a Real Run

Pipeline executed on 3 technical ML documents (~7,500 words total). Generated 200 training examples across 4 task types, evaluated every one, trained a 0.5B model locally, and benchmarked the result.

200 Examples generated

4.77 Avg judge score (out of 5)

+16.8% ROUGE-1 improvement

0 Duplicates found

p = 0.0 Statistical significance

Base Model vs Fine-Tuned

Qwen 2.5 0.5B, before and after LoRA fine-tuning on the generated data. 40 held-out test examples. Significance tested with paired bootstrap (1000 resamples).

ROUGE-1

0.418

0.586

+16.8%

ROUGE-2

0.185

0.311

+12.7%

ROUGE-L

0.289

0.417

+12.9%

Gray = base model. Blue = fine-tuned. Green = improvement delta.

Training Dynamics

Loss Curve

Iter	Train Loss	Event
1	-	Val loss: 2.345
5	1.922
20	1.245
40	0.783	Checkpoint saved
60	0.350
80	0.346	Checkpoint saved
100	0.176
120	0.144	Val loss: 0.730

Configuration

Model	Qwen 2.5 0.5B Instruct
Method	LoRA (rank 8, 16 layers)
Trainable params	2.9M / 494M (0.59%)
Training time	5 minutes (Apple Silicon)
Peak memory	9.0 GB
Epochs	3
Batch size	4
Learning rate	1e-4

Data Quality Breakdown

LLM-as-Judge Scores

GPT-4.1-mini scored every example on 4 dimensions: faithfulness, helpfulness, complexity, coherence.

Rating	Count	Fraction
5 / 5	173	86%
4 / 5	27	14%
3 / 5 or below	0	0%

Difficulty Distribution

Heuristic calibration based on output length, vocabulary complexity, and reasoning depth.

Level	Count	Fraction
Easy	2	1%
Medium	76	38%
Hard	122	61%

Sample Generated Examples

Real examples from the pipeline output, unedited.

Input

Read the following passage and generate ONE useful question and its answer. Passage: Common evaluation metrics include: Accuracy (fraction of correct predictions), Precision (fraction of positive predictions that are correct), Recall (fraction of actual positives correctly identified), and F1 Score (harmonic mean of precision and recall).

Output

Question: What are some common evaluation metrics used for classification tasks, and what do they measure? Answer: Common evaluation metrics for classification tasks include Accuracy, which measures the fraction of correct predictions; Precision, which measures the fraction of positive predictions that are correct; Recall, which measures the fraction of actual positives correctly identified; and F1 Score, which is the harmonic mean of precision and recall.

Judge score: 4.75 / 5 | Difficulty: medium

Input

Summarise the following text: The authors introduce FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images.

Output

The text discusses the challenges of applying vision-language models to medical imaging due to domain shifts and complex expert knowledge. To address this, the authors introduce FLAIR, a pre-trained vision-language model designed for comprehensive retinal fundus image analysis. They compiled 38 open-access fundus imaging datasets, encompassing 101 target conditions and 288,307 images. FLAIR incorporates expert domain knowledge through descriptive textual prompts during pre-training and zero-shot evaluation.

Judge score: 4.75 / 5 | Difficulty: medium

Input

Based on the following passage, generate a realistic user instruction and then write the ideal response. Passage: Foundation models are large-scale machine learning models designed to be adaptable across a wide range of downstream tasks.

Output

INSTRUCTION: Can you explain what foundation models are and why they are important in fields like computer vision and natural language processing? RESPONSE: Foundation models are large-scale machine learning models that are designed to be adaptable across a wide range of downstream tasks. Unlike models trained for a specific purpose, foundation models serve as a versatile base that can be fine-tuned or adapted to various applications.

Judge score: 4.75 / 5 | Difficulty: hard

Input

Based on the following passage, create a question that requires multi-step reasoning to answer. Then provide a detailed chain-of-thought answer. Passage: The Chinchilla scaling laws suggest that a model with N parameters should be trained on approximately 20N tokens for optimal compute efficiency.

Output

QUESTION: If a Transformer model has 5 billion parameters, how many training tokens should it be trained on for optimal compute efficiency, and why does this approach differ from previous practices? REASONING: Step 1: Identify the relationship. The passage states a model with N parameters should be trained on approximately 20N tokens. Step 2: Calculate. For N = 5 billion: 20 x 5 billion = 100 billion tokens. Step 3: Compare to previous practice. Earlier approaches trained increasingly large models on fixed datasets, rather than scaling data proportionally. ANSWER: 100 billion tokens, because the Chinchilla laws show model size and training data should scale together.

Judge score: 5.0 / 5 | Difficulty: hard

Pipeline Timing

Step	Time	Notes
Generate (GPT-4.1-mini)	9 min 31 sec	200 examples, 4 task types, 162 chunks
Quality scoring	0.6 sec	Heuristic flags (no API calls)
Deduplication	0.4 sec	Hash-based, zero duplicates
LLM-as-Judge	17 min 5 sec	200 examples x 4 scoring dimensions
Difficulty calibration	0.6 sec	Heuristic (no API calls)
Train/test split	0.3 sec	80/20 stratified by task type
LoRA fine-tuning	5 min 0 sec	120 iterations, Qwen 2.5 0.5B, Apple Silicon
Benchmark	2 min 25 sec	40 test examples, base + fine-tuned inference
Total	~35 min	Cost: ~$5

Try It

git clone https://github.com/pugalenthi0928/training-data-factory.git
cd training-data-factory
make install

# Dry run (no API key needed)
make forge

# Full pipeline
export OPENAI_API_KEY=sk-...
make forge-live

Read the overview for a plain-English explanation, or the technical deep-dive for implementation details.