LLM-as-judge scoring results for 16 experimental runs. Each JSON contains per-question scores from a 3-judge panel (openai-gpt-5.4, claude-opus-4-6, llama4-maverick).
See ../input/README.md for the full run mapping, experiment design, and source folder traceability.
| Run | Factors | Score % | MH % | File |
|---|---|---|---|---|
| 1 | Baseline (no factors) | 60.9% | 68.5% | run-1-baseline.json |
| 2 | Domain Prompt | 71.5% | 63.5% | run-2-domain.json |
| 3 | Agentic | 72.2% | 93.5% | run-3-agentic.json |
| 4 | Citation + Agentic | 93.8% | 91.5% | run-4-cite-agentic.json |
| 5 | Domain + Citation | 71.5% | 54.5% | run-5-domain-cite.json |
| 6 | Citation | 71.1% | 48.5% | run-6-cite.json |
| 7 | Citation + Agentic + Self-Critique | 93.2% | 93.5% | run-7-cite-agentic-selfcritique.json |
| 8 | All 4 factors | 73.0% | 91.2% | run-8-all4.json |
| 9 | Domain + Agentic | 70.4% | 89.8% | run-9-domain-agentic.json |
| 10 | Domain + Citation + Agentic | 76.0% | 90.5% | run-10-domain-cite-agentic.json |
| 11 | Self-Critique | 56.9% | 66.5% | run-11-selfcritique.json |
| 12 | Domain + Self-Critique | 62.0% | 69.0% | run-12-domain-selfcritique.json |
| 13 | Citation + Self-Critique | 65.7% | 60.7% | run-13-cite-selfcritique.json |
| 14 | Domain + Citation + Self-Critique | 67.4% | 69.3% | run-14-domain-cite-selfcritique.json |
| 15 | Agentic + Self-Critique | 70.8% | 88.2% | run-15-agentic-selfcritique.json |
| 16 | Domain + Agentic + Self-Critique | 71.2% | 88.7% | run-16-domain-agentic-selfcritique.json |
Each question is scored on 5 dimensions (0 = Miss, 1 = Partial, 2 = Full):
Plus 4 binary must-have elements per question (pass/fail).
Score % = total points / 500 max. MH % = must-have passes / 200 max.