// Datasets

Dataset Suite

The five public test surfaces used to evaluate GTU across factual retention, confusion control, and long-context stress.

Dataset inventory

The public benchmark suite is intentionally split into multiple surfaces so readers can distinguish balanced benchmark performance, similar-fact confusion, enterprise-style tasks, and long-context validation.

Dataset	Cases	What it tests
stratified125	125	Main balanced black-box benchmark
near_neighbor	32	Similar-fact confusion and exact recall
business_task	24	Business-facing task realism
long_context_layered	20	Stability from 10k to 800k+ baseline tokens
ultra_long	4	Real-model extreme-length validation

How cases are structured

Publicly described cases follow a common shape: target facts or constraints are injected early, unrelated history is added, and a fact-sensitive probe is asked at the end.

early fact injection
topic switches or noise flooding
similar-fact distractors where appropriate
fact-sensitive end question

Why this suite matters

The suite turns GTU claims into something auditable. It gives readers a way to understand both everyday and extreme evaluation surfaces without requiring access to internal implementation detail.