aeo-methodology

Results by Test Type

Score % for each test type (Explain, Implement, Debug, Compare) across all 16 runs.

Score % by Test Type

Test Type Qs R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16
Explain 12 59.1% 67.5% 72.8% 97.5% 72.5% 78.1% 96.4% 74.5% 72.8% 78.1% 61.5% 66.1% 68.9% 78.2% 72.8% 73.9%
Implement 23 57.8% 78.5% 71.2% 91.3% 76.1% 69.9% 92.1% 68.7% 68.7% 75.3% 49.7% 58.8% 63.8% 64.1% 68.7% 69.1%
Debug 5 58.0% 76.6% 66.7% 89.3% 74.0% 53.4% 82.1% 79.3% 65.3% 73.3% 68.0% 58.4% 65.0% 57.0% 67.7% 67.7%
Compare 10 71.7% 57.7% 76.4% 97.3% 58.3% 74.7% 97.7% 78.0% 74.0% 76.3% 62.5% 66.0% 66.7% 67.0% 75.0% 74.3%

Test Type Improvement: Baseline (Run 1) vs Best (Run 4)

Test Type Qs Baseline Best (R4) Delta
Explain 12 59.1% 97.5% +38.4pp
Implement 23 57.8% 91.3% +33.5pp
Debug 5 58.0% 89.3% +31.3pp
Compare 10 71.7% 97.3% +25.6pp

Must-Have Pass % by Test Type

Test Type Qs R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16
Explain 12 68.8% 60.4% 93.8% 95.8% 54.2% 58.3% 97.9% 93.1% 91.0% 93.8% 71.5% 79.9% 71.5% 79.9% 91.7% 91.6%
Implement 23 66.3% 75.0% 95.7% 92.4% 57.6% 44.6% 93.5% 88.0% 92.0% 89.5% 60.1% 64.2% 52.9% 62.7% 89.5% 89.1%
Debug 5 50.0% 60.0% 75.0% 70.0% 55.0% 25.0% 75.0% 90.0% 71.7% 76.6% 68.3% 46.7% 43.4% 53.3% 61.7% 71.7%
Compare 10 82.5% 42.5% 97.5% 95.0% 47.5% 57.5% 97.5% 96.6% 92.5% 95.8% 74.1% 78.3% 74.1% 80.0% 94.1% 92.5%