// Evidence
Headline benchmark framing, operating-point comparisons, and result summaries drawn from the public black-box report.
The primary public benchmark uses 125 black-box cases on gpt-5.4 at temperature 0.0. The benchmark covers infrastructure fact recall, environment separation, configuration lookup, long debug context, and planning constraint tracking.
Model
gpt-5.4
Temperature
0.0
Cases
125
Operating points
low-cost, balanced, high-quality
The benchmark is presented as a quality-at-budget frontier rather than a single fixed trade-off.
| Approach | Avg Prompt Tokens | Contains Rate | Hit Rate | Exact KV Recall | Exact Match |
|---|---|---|---|---|---|
| Full History | 171.38 | 100.0% | 95.2% | 100.0% | 14.0% |
| Sliding Window | 81.83 | 1.0% | 24.7% | 6.0% | 0.0% |
| Low-Cost | 92.12 | 82.0% | 78.5% | 65.6% | 13.0% |
| Balanced | 105.74 | 72.0% | 70.4% | 57.6% | 36.0% |
| High-Quality | 167.96 | 83.0% | 78.3% | 66.4% | 26.0% |
The public report uses black-box visuals to show operating-point trade-offs and domain-level behavior without disclosing implementation details.
The report highlights three practical takeaways: sliding window is cheap but too lossy for fact-sensitive use, low-cost is the strongest efficiency-oriented GTU setting, and high-quality is the strongest fact-preservation-oriented GTU setting in the public report.