Intro
Overview
Pareto Frontier
Pareto Modifier
Delta Influence
Cross-Hardware
Tile Lattice
Throughput
Loading blog.md...
Interactive Frontier
Baseline-aware heuristic mode starts from a known dense Transformer and searches local same-quality hardware-fit moves: parallelism, layout, kernels, cache policy, and scheduling. Shape, GQA, KV quantization, and FP8 changes are optional quality-spending moves.
+0.00% optional loss proxy
Scales baseline (mem, KV, TBT) to the chosen deployment regime so the bottleneck card surfaces axes the modifier can actually relieve.
Baseline Delta Frontier
Pick any baseline architecture, hardware target, and workload regime, then choose one or
more deltas from the transformation library on the left. The result panel on the right
evaluates each delta against the baseline and classifies it against a local heuristic reference set built in the browser.
Delta Library
One chip → that delta alone. Multiple chips → deltas stack into a combined evaluation.
Hybrid components: pick at most one per category (Type / Placement / Ratio).
Constraint Perturbation
Architecture Dimension Perturbation
What happens if you perturb each dimension of the optimal architecture? Each row shows the marginal cost/benefit of a single change.
Available for 2T token configurations. Constraint perturbation requires optimizer re-runs and is computed for select configurations.
Tile-Aligned Architecture Lattice
Hardware-aware transformer dimensions for H100, B200, and TPU v5e — every efficient (d_model, d_head, n_heads, FFN_dim) at each precision and TP degree.
Lattice Browser
Config Calculator
GQA Configs
MoE Sizing
State Dims
Cross-Precision
Validation
–
Find efficient architectures for a target parameter count. Enumerates all lattice-aligned configurations, computes
n_layers to hit target params, and ranks by composite efficiency score.| Rank | d_model | n_heads | d_head | FFN dim | n_kv_heads | n_layers | Params (B) | Tile Util | Wave @2K | Wave @8K | Score |
|---|
Valid (n_heads, n_kv_heads) pairs for each d_model and TP degree, showing which GQA ratios are tile-aligned.
| n_heads | n_kv_heads | GQA Ratio | KV Proj / GPU | Aligned |
|---|
Tile-aligned expert FFN dimensions for MoE architectures. Each expert's matmul is a separate kernel invocation — per-expert alignment is critical.
| n_experts | Expert FFN dim | Total FFN equiv | Aligned |
|---|
Valid d_state values for state mechanisms (Mamba-2 / structured SSM). Must be tile-aligned for the state update matmul and fit in SRAM.
| d_state | d_head | SRAM / head (bytes) | Aligned |
|---|
For mixed-precision architectures (e.g. BF16 attention + FP8 FFN), dimensions must satisfy tile alignment for both precisions simultaneously. The intersection lattice is sparser than either single-precision lattice.
Tile alignment check for known production architectures. Verifies whether each dimension satisfies CTA-level alignment.
| Architecture | d_model | d_head | n_heads | FFN dim | n_layers | d_model%K | d_model%N | d_head%K | d_head%N | FFN%N | Status |
|---|
Throughput Model
Hardware-aware architecture throughput estimation — dense transformer with GQA, across H100, B200, TPU v5e, and TPU v5p.
Comparison
Layer Breakdown
Decode Analysis
Validation
Cross-Hardware Comparison
| Hardware | Train tok/s | Prefill ms | Decode tok/s | Memory GB | Bottleneck |
|---|
Training Throughput by Architecture (H100)
Inference TBT (Time Between Tokens) by Architecture (H100)
Inference TTFT (Time To First Token) by Architecture (H100)
Per-Layer Time Breakdown (Training)
Training Operation Costs
Per-Layer Time Breakdown (Prefill)
Per-Layer Time Breakdown (Decode)
Decode Latency vs KV Cache Length
GQA Impact on Decode
Training Validation (H100, ≤25% error target)
| Architecture | Predicted | Measured | Ratio | Error | Status |
|---|
Decode Validation (H100, ≤25% error target)
| Architecture | Predicted tok/s | Measured tok/s | Ratio | Error | Status |
|---|