aeo-methodology

Results Overview

Executive summary of the AEO 2^4 factorial experiment.

Rank	Run	Configuration	Engine	Score %	MH %
1	4	cite, agentic	cortex-code	93.8%	91.5%
2	7	cite, agentic, selfcritique	cortex-code	93.2%	93.5%
3	10	domain, cite, agentic	cortex-code	76.0%	90.5%
4	8	domain, cite, agentic, selfcritique	cortex-code	73.0%	91.2%
5	3	agentic	cortex-code	72.2%	93.5%
6	2	domain	claude	71.5%	63.5%
7	5	domain, cite	claude	71.5%	54.5%
8	16	domain, agentic, selfcritique	cortex-code	71.2%	88.7%
9	6	cite	claude	71.1%	48.5%
10	15	agentic, selfcritique	cortex-code	70.8%	88.2%
11	9	domain, agentic	cortex-code	70.4%	89.8%
12	14	domain, cite, selfcritique	claude	67.4%	69.3%
13	13	cite, selfcritique	claude	65.7%	60.7%
14	12	domain, selfcritique	claude	62.0%	69.0%
15	1	(baseline)	claude	60.9%	68.5%
16	11	selfcritique	claude	56.9%	66.5%

Best run (Run 4, 93.8% overall):

Category	Score %
Snowpark Container Services	100.0%
Streams, Tasks, Snowpipe	100.0%
Snowpark	99.3%
Native Apps Framework	98.4%
Architecture and Fundamentals	97.8%
Dynamic Tables	96.7%
Apache Iceberg Tables	96.7%
Governance and Security	96.6%
Cortex Search (RAG)	94.1%
Cortex AI Functions	91.3%
Snowflake ML	90.0%
Streamlit in Snowflake	88.3%
Cortex Agents	68.9%

Worst run (Run 11, 56.9% overall):

Category	Score %
Streams, Tasks, Snowpipe	70.0%
Snowpark Container Services	68.3%
Cortex Search (RAG)	62.5%
Architecture and Fundamentals	62.2%
Dynamic Tables	62.0%
Governance and Security	61.7%
Snowpark	58.7%
Streamlit in Snowflake	57.9%
Apache Iceberg Tables	55.3%
Native Apps Framework	55.0%
Cortex Agents	47.2%
Snowflake ML	46.0%
Cortex AI Functions	42.0%

5 Hardest Questions:

Q#	Category	Test Type	Avg Score	Avg %
11	Cortex Agents	Implement	4.7/10	46.6%
12	Cortex Agents	Implement	5.0/10	49.8%
35	Snowflake ML	Implement	5.9/10	58.5%
5	Cortex AI Functions	Debug	6.0/10	60.0%
21	Snowpark	Compare	6.0/10	60.2%

5 Easiest Questions:

Q#	Category	Test Type	Avg Score	Avg %
45	Streams, Tasks, Snowpipe	Compare	8.1/10	81.3%
8	Cortex Search (RAG)	Implement	8.2/10	81.7%
17	Dynamic Tables	Explain	8.4/10	83.5%
18	Snowpark	Explain	8.4/10	83.6%
6	Cortex Search (RAG)	Explain	8.6/10	85.6%

Agentic + Citation (no domain prompt) is the optimal configuration (Run 4: 93.8% score, 91.5% MH).
Agentic tools are the dominant factor, improving both score (+12.5pp) and must-have compliance (+19.2pp).
Citation is the second most impactful factor (+8.7pp score), especially when combined with agentic tools.
Domain prompt helps non-agentic runs but harms agentic ones (-2.0pp average, but -17 to -20pp when agentic+citation are present).
Self-critique is counterproductive (-4.6pp score average), degrading quality in 7 of 8 paired comparisons.

See results-by-factor.md for detailed factor analysis.