// Evidence

External Black-Box Report

Headline benchmark framing, operating-point comparisons, and result summaries drawn from the public black-box report.

Evaluation scope

The primary public benchmark uses 125 black-box cases on gpt-5.4 at temperature 0.0. The benchmark covers infrastructure fact recall, environment separation, configuration lookup, long debug context, and planning constraint tracking.

Model

gpt-5.4

Temperature

0.0

Cases

125

Operating points

low-cost, balanced, high-quality

Headline results

The benchmark is presented as a quality-at-budget frontier rather than a single fixed trade-off.

Approach	Avg Prompt Tokens	Contains Rate	Hit Rate	Exact KV Recall	Exact Match
Full History	171.38	100.0%	95.2%	100.0%	14.0%
Sliding Window	81.83	1.0%	24.7%	6.0%	0.0%
Low-Cost	92.12	82.0%	78.5%	65.6%	13.0%
Balanced	105.74	72.0%	70.4%	57.6%	36.0%
High-Quality	167.96	83.0%	78.3%	66.4%	26.0%

Public figures

The public report uses black-box visuals to show operating-point trade-offs and domain-level behavior without disclosing implementation details.

GTU quality-at-budget frontier on the formal 125-case benchmark — Cost-quality frontier for the public formal benchmark.

GTU domain-by-domain heatmap from the public evaluation — Domain-level view across the public benchmark surface.

Interpretation

The report highlights three practical takeaways: sliding window is cheap but too lossy for fact-sensitive use, low-cost is the strongest efficiency-oriented GTU setting, and high-quality is the strongest fact-preservation-oriented GTU setting in the public report.