// Evaluation

External black-box evaluation for Project GTU.

This page summarizes the public benchmark surface used to describe GTU: what is tested, how the operating points are compared, and which headline results are safe to cite without discussing internal mechanics.

Primary benchmark

125 black-box cases

Model

gpt-5.4

Temperature

0.0

Comparison set

full history, sliding window, GTU operating points

// Benchmark Framing

What the public evaluation is trying to prove.

The public standard is outcome-based: whether GTU preserves the right facts under bounded prompt budgets, topic switches, similar-fact confusion, and very long histories.

Full History

Sends the full interaction history as the quality reference point.

Sliding Window

Sends only recent turns as the low-cost baseline.

Low-Cost

GTU operating point optimized for stronger prompt compression.

Balanced

GTU operating point tuned for stricter answer conformity.

High-Quality

GTU operating point tuned for stronger fact preservation.

// Headline Table

Formal 125-case benchmark summary.

Open detailed report →

Approach	Avg Prompt Tokens	Contains Rate	Hit Rate	Exact KV Recall	Exact Match
Full History	171.38	100.0%	95.2%	100.0%	14.0%
Sliding Window	81.83	1.0%	24.7%	6.0%	0.0%
Low-Cost	92.12	82.0%	78.5%	65.6%	13.0%
Balanced	105.74	72.0%	70.4%	57.6%	36.0%
High-Quality	167.96	83.0%	78.3%	66.4%	26.0%

Project GTU quality-at-budget frontier on the formal benchmark

Project GTU domain-by-domain heatmap from the external black-box evaluation

// Standout Notes

Infrastructure recall remained especially strong across GTU operating points in the public report.

Environment separation was evaluated as a contamination-sensitive task, with no observed cross-thread contamination in the highlighted run.

Configuration lookup remained one of the strongest quality-oriented surfaces for GTU.

App debug remains the clearest improvement area and is treated as a known limitation rather than a headline claim.

// Ultra-Long Validation

Extreme history size is part of the public proof surface.

The ultra-long set asks a narrow but important question: when baseline context becomes operationally extreme, can GTU still compress the final prompt aggressively without visible collapse in factual fidelity?

Project GTU ultra-long validation figure

Approach

Avg Prompt Tokens

Contains Rate

Hit Rate

Exact KV Recall

Exact Match

Full History

171.38

100.0%

95.2%

100.0%

14.0%

Sliding Window

81.83

1.0%

24.7%

6.0%

0.0%

Low-Cost

92.12

82.0%

78.5%

65.6%

13.0%

Balanced

105.74

72.0%

70.4%

57.6%

36.0%

High-Quality

167.96

83.0%

78.3%

66.4%

26.0%