aeo-methodology

AEO Benchmark Input Data

This directory contains the input data for the AEO (AI Engine Optimization) benchmark. Generated responses are in ../output/, scoring results in ../scores/.

Directory Structure

aeo/
├── input/
│   ├── README.md                 ← This file
│   ├── question-bank.md          ← 50 Snowflake developer questions (13 categories)
│   ├── canonical-answers.md      ← Ground-truth answers with must-have scoring elements
│   ├── experiment-prompts.md     ← Full 2^4 factorial design, prompts, and judge template
│   └── summary.md                ← Results summary with factor analysis
├── output/                       ← Generated responses per run (16 files)
└── scores/                       ← LLM-as-judge scoring results per run (16 JSON files)

Experiment Design

A 2^4 factorial design with 4 binary factors, all using claude-opus-4-6 as the backbone model:

Factor OFF ON
Domain Prompt No system message Snowflake knowledge primer as system message
Citation Raw question only “Reference official Snowflake documentation” appended
Agentic Single CORTEX.COMPLETE call Cortex Code subagent with web search, skills, doc search
Self-Critique Single-turn generation Two-turn generate-then-revise pattern

Run Mapping

The run numbers below correspond to the files in ../output/ and ../scores/. Runs 1-16 cover the full 2^4 factorial (16 unique factor combinations). Early physical experiment folders used non-contiguous numbering; the Source Folder Mapping section below provides traceability.

Run Domain Citation Agentic Self-Critique Execution Response File Score %
1         CORTEX.COMPLETE run-1-baseline-claude.md 60.9%
2 x       CORTEX.COMPLETE run-2-domain-claude.md 71.5%
3     x   Cortex Code run-3-agentic-cortex-code.md 72.2%
4   x x   Cortex Code run-4-cite-agentic-cortex-code.md 93.8%
5 x x     CORTEX.COMPLETE run-5-domain-cite-claude.md 71.5%
6   x     CORTEX.COMPLETE run-6-cite-claude.md 71.1%
7   x x x Cortex Code run-7-cite-agentic-selfcritique-cortex-code.md 93.2%
8 x x x x Cortex Code run-8-all4-cortex-code.md 73.0%
9 x   x   Cortex Code run-9-domain-agentic-cortex-code.md 70.4%
10 x x x   Cortex Code run-10-domain-cite-agentic-cortex-code.md 76.0%
11       x CORTEX.COMPLETE run-11-selfcritique-claude.md 56.9%
12 x     x CORTEX.COMPLETE run-12-domain-selfcritique-claude.md 62.0%
13   x   x CORTEX.COMPLETE run-13-cite-selfcritique-claude.md 65.7%
14 x x   x CORTEX.COMPLETE run-14-domain-cite-selfcritique-claude.md 67.4%
15     x x Cortex Code run-15-agentic-selfcritique-cortex-code.md 70.8%
16 x   x x Cortex Code run-16-domain-agentic-selfcritique-cortex-code.md 71.2%

File Naming Convention

Response files (in ../output/) follow the pattern:

run-{N}-{factors}-{engine}.md

Score files (in ../scores/) follow the pattern:

run-{N}-{factors}.json

Each JSON contains per-question scores from a 3-judge panel (openai-gpt-5.4, claude-opus-4-6, llama4-maverick), with 5 scoring dimensions (0-2 each) and 4 must-have elements per question.

Source Folder Mapping

For traceability back to the original experiment data:

Run Physical Folder
1 run-3-baseline-8192tok
2 run-4-augmented-curated-8192tok
3 run-6-native-cc-opus
4 run-7-native-cc-opus-cite
5 run-8-augmented-cite-8192tok
6 run-9-baseline-cite-8192tok
7 run-10-native-cc-opus-refine
8 run-11-native-cc-opus-all4
9 run-12-native-cc-opus-prompt-agentic
10 run-13-native-cc-opus-prompt-cite-agentic
11 run-14-selfcritique-only
12 run-15-domain-selfcritique
13 run-16-cite-selfcritique
14 run-17-domain-cite-selfcritique
15 run-18-agentic-selfcritique
16 run-19-domain-agentic-selfcritique