// Evaluation
This page summarizes the public benchmark surface used to describe GTU: what is tested, how the operating points are compared, and which headline results are safe to cite without discussing internal mechanics.
Primary benchmark
125 black-box cases
Model
gpt-5.4
Temperature
0.0
Comparison set
full history, sliding window, GTU operating points
// Benchmark Framing
The public standard is outcome-based: whether GTU preserves the right facts under bounded prompt budgets, topic switches, similar-fact confusion, and very long histories.
Sends the full interaction history as the quality reference point.
Sends only recent turns as the low-cost baseline.
GTU operating point optimized for stronger prompt compression.
GTU operating point tuned for stricter answer conformity.
GTU operating point tuned for stronger fact preservation.
// Headline Table
| Approach | Avg Prompt Tokens | Contains Rate | Hit Rate | Exact KV Recall | Exact Match |
|---|---|---|---|---|---|
| Full History | 171.38 | 100.0% | 95.2% | 100.0% | 14.0% |
| Sliding Window | 81.83 | 1.0% | 24.7% | 6.0% | 0.0% |
| Low-Cost | 92.12 | 82.0% | 78.5% | 65.6% | 13.0% |
| Balanced | 105.74 | 72.0% | 70.4% | 57.6% | 36.0% |
| High-Quality | 167.96 | 83.0% | 78.3% | 66.4% | 26.0% |
// Standout Notes
Infrastructure recall remained especially strong across GTU operating points in the public report.
Environment separation was evaluated as a contamination-sensitive task, with no observed cross-thread contamination in the highlighted run.
Configuration lookup remained one of the strongest quality-oriented surfaces for GTU.
App debug remains the clearest improvement area and is treated as a known limitation rather than a headline claim.
// Ultra-Long Validation
The ultra-long set asks a narrow but important question: when baseline context becomes operationally extreme, can GTU still compress the final prompt aggressively without visible collapse in factual fidelity?