This directory contains the input data for the AEO (AI Engine Optimization) benchmark. Generated responses are in ../output/, scoring results in ../scores/.
aeo/
├── input/
│ ├── README.md ← This file
│ ├── question-bank.md ← 50 Snowflake developer questions (13 categories)
│ ├── canonical-answers.md ← Ground-truth answers with must-have scoring elements
│ ├── experiment-prompts.md ← Full 2^4 factorial design, prompts, and judge template
│ └── summary.md ← Results summary with factor analysis
├── output/ ← Generated responses per run (16 files)
└── scores/ ← LLM-as-judge scoring results per run (16 JSON files)
A 2^4 factorial design with 4 binary factors, all using claude-opus-4-6 as the backbone model:
| Factor | OFF | ON |
|---|---|---|
| Domain Prompt | No system message | Snowflake knowledge primer as system message |
| Citation | Raw question only | “Reference official Snowflake documentation” appended |
| Agentic | Single CORTEX.COMPLETE call | Cortex Code subagent with web search, skills, doc search |
| Self-Critique | Single-turn generation | Two-turn generate-then-revise pattern |
The run numbers below correspond to the files in ../output/ and ../scores/. Runs 1-16 cover the full 2^4 factorial (16 unique factor combinations). Early physical experiment folders used non-contiguous numbering; the Source Folder Mapping section below provides traceability.
| Run | Domain | Citation | Agentic | Self-Critique | Execution | Response File | Score % |
|---|---|---|---|---|---|---|---|
| 1 | CORTEX.COMPLETE | run-1-baseline-claude.md |
60.9% | ||||
| 2 | x | CORTEX.COMPLETE | run-2-domain-claude.md |
71.5% | |||
| 3 | x | Cortex Code | run-3-agentic-cortex-code.md |
72.2% | |||
| 4 | x | x | Cortex Code | run-4-cite-agentic-cortex-code.md |
93.8% | ||
| 5 | x | x | CORTEX.COMPLETE | run-5-domain-cite-claude.md |
71.5% | ||
| 6 | x | CORTEX.COMPLETE | run-6-cite-claude.md |
71.1% | |||
| 7 | x | x | x | Cortex Code | run-7-cite-agentic-selfcritique-cortex-code.md |
93.2% | |
| 8 | x | x | x | x | Cortex Code | run-8-all4-cortex-code.md |
73.0% |
| 9 | x | x | Cortex Code | run-9-domain-agentic-cortex-code.md |
70.4% | ||
| 10 | x | x | x | Cortex Code | run-10-domain-cite-agentic-cortex-code.md |
76.0% | |
| 11 | x | CORTEX.COMPLETE | run-11-selfcritique-claude.md |
56.9% | |||
| 12 | x | x | CORTEX.COMPLETE | run-12-domain-selfcritique-claude.md |
62.0% | ||
| 13 | x | x | CORTEX.COMPLETE | run-13-cite-selfcritique-claude.md |
65.7% | ||
| 14 | x | x | x | CORTEX.COMPLETE | run-14-domain-cite-selfcritique-claude.md |
67.4% | |
| 15 | x | x | Cortex Code | run-15-agentic-selfcritique-cortex-code.md |
70.8% | ||
| 16 | x | x | x | Cortex Code | run-16-domain-agentic-selfcritique-cortex-code.md |
71.2% |
Response files (in ../output/) follow the pattern:
run-{N}-{factors}-{engine}.md
{N} = run number (1-16){factors} = active factors (e.g., baseline, domain-cite, cite-agentic-selfcritique, all4){engine} = claude (CORTEX.COMPLETE) or cortex-code (agentic subagent)Score files (in ../scores/) follow the pattern:
run-{N}-{factors}.json
Each JSON contains per-question scores from a 3-judge panel (openai-gpt-5.4, claude-opus-4-6, llama4-maverick), with 5 scoring dimensions (0-2 each) and 4 must-have elements per question.
For traceability back to the original experiment data:
| Run | Physical Folder |
|---|---|
| 1 | run-3-baseline-8192tok |
| 2 | run-4-augmented-curated-8192tok |
| 3 | run-6-native-cc-opus |
| 4 | run-7-native-cc-opus-cite |
| 5 | run-8-augmented-cite-8192tok |
| 6 | run-9-baseline-cite-8192tok |
| 7 | run-10-native-cc-opus-refine |
| 8 | run-11-native-cc-opus-all4 |
| 9 | run-12-native-cc-opus-prompt-agentic |
| 10 | run-13-native-cc-opus-prompt-cite-agentic |
| 11 | run-14-selfcritique-only |
| 12 | run-15-domain-selfcritique |
| 13 | run-16-cite-selfcritique |
| 14 | run-17-domain-cite-selfcritique |
| 15 | run-18-agentic-selfcritique |
| 16 | run-19-domain-agentic-selfcritique |