// Datasets
The five public test surfaces used to evaluate GTU across factual retention, confusion control, and long-context stress.
The public benchmark suite is intentionally split into multiple surfaces so readers can distinguish balanced benchmark performance, similar-fact confusion, enterprise-style tasks, and long-context validation.
| Dataset | Cases | What it tests |
|---|---|---|
| stratified125 | 125 | Main balanced black-box benchmark |
| near_neighbor | 32 | Similar-fact confusion and exact recall |
| business_task | 24 | Business-facing task realism |
| long_context_layered | 20 | Stability from 10k to 800k+ baseline tokens |
| ultra_long | 4 | Real-model extreme-length validation |
Publicly described cases follow a common shape: target facts or constraints are injected early, unrelated history is added, and a fact-sensitive probe is asked at the end.
early fact injection
topic switches or noise flooding
similar-fact distractors where appropriate
fact-sensitive end question
The suite turns GTU claims into something auditable. It gives readers a way to understand both everyday and extreme evaluation surfaces without requiring access to internal implementation detail.