Architecture Compiler v1

Hardware Target

Parameter Target

Training Tokens

Serving Constraint

Feed-Forward Layers

Dense: single FFN per layer. MoE: routed expert FFNs (more total params, same active).

Attention vs State

Transformer: all layers use self-attention. Hybrid: some layers replaced with a state mechanism (Mamba-2, Gated DeltaNet, KDA, GLA, Sliding Window, ...) — less KV cache.

Serving Context

8k dense rows are native optimizer outputs. Longer contexts are scaled projections unless marked otherwise; infeasible HBM/TBT/TTFT points are filtered before display.

Tradeoff Preference

Score = Σ w_axis · max(0, (c − winner)/winner) on the family-champion frontier. "Quality-first" is the legacy behavior.

Configuration

Intro

Overview

Pareto Frontier

Pareto Modifier

Delta Influence

Cross-Hardware

Tile Lattice

Throughput

Loading blog.md...

X Axis

Y Axis

Color

Size

Interactive Frontier

Baseline-aware heuristic mode starts from a known dense Transformer and searches local same-quality hardware-fit moves: parallelism, layout, kernels, cache policy, and scheduling. Shape, GQA, KV quantization, and FP8 changes are optional quality-spending moves.

Baseline

Quality Mode

+0.00% optional loss proxy

Workload Preset

Scales baseline (mem, KV, TBT) to the chosen deployment regime so the bottleneck card surfaces axes the modifier can actually relieve.

TP Options

Quality-Spending Knobs

Baseline Delta Frontier

Pick any baseline architecture, hardware target, and workload regime, then choose one or more deltas from the transformation library on the left. The result panel on the right evaluates each delta against the baseline and classifies it against a local heuristic reference set built in the browser.

Baseline

Hardware

Workload Preset

Delta Library

One chip → that delta alone. Multiple chips → deltas stack into a combined evaluation. Hybrid components: pick at most one per category (Type / Placement / Ratio).

Constraint Perturbation

Architecture Dimension Perturbation

What happens if you perturb each dimension of the optimal architecture? Each row shows the marginal cost/benefit of a single change.

Available for 2T token configurations. Constraint perturbation requires optimizer re-runs and is computed for select configurations.

Tile-Aligned Architecture Lattice

Hardware-aware transformer dimensions for H100, B200, and TPU v5e — every efficient (d_model, d_head, n_heads, FFN_dim) at each precision and TP degree.

Lattice Browser

Config Calculator

GQA Configs

MoE Sizing

State Dims

Cross-Precision

Validation

Hardware

Precision

TP Degree

d_head

Show only aligned

d_model range

–

Find efficient architectures for a target parameter count. Enumerates all lattice-aligned configurations, computes n_layers to hit target params, and ranks by composite efficiency score.

Hardware

Precision

TP Degree

Target Params (B)

d_head

FFN Type

GQA Ratio

Vocab Size

Min Layers

Max Layers

Rank	d_model	n_heads	d_head	FFN dim	n_kv_heads	n_layers	Params (B)	Tile Util	Wave @2K	Wave @8K	Score

Valid (n_heads, n_kv_heads) pairs for each d_model and TP degree, showing which GQA ratios are tile-aligned.

Hardware

Precision

TP Degree

d_model

d_head

n_heads	n_kv_heads	GQA Ratio	KV Proj / GPU	Aligned

Tile-aligned expert FFN dimensions for MoE architectures. Each expert's matmul is a separate kernel invocation — per-expert alignment is critical.

Hardware

Precision

d_model

n_experts	Expert FFN dim	Total FFN equiv	Aligned

Valid d_state values for state mechanisms (Mamba-2 / structured SSM). Must be tile-aligned for the state update matmul and fit in SRAM.

Hardware

Precision

Max d_state

d_state	d_head	SRAM / head (bytes)	Aligned

For mixed-precision architectures (e.g. BF16 attention + FP8 FFN), dimensions must satisfy tile alignment for both precisions simultaneously. The intersection lattice is sparser than either single-precision lattice.

Hardware

Precision A

Precision B

TP Degree

Tile alignment check for known production architectures. Verifies whether each dimension satisfies CTA-level alignment.

Hardware

Precision

Architecture	d_model	d_head	n_heads	FFN dim	n_layers	d_model%K	d_model%N	d_head%K	d_head%N	FFN%N	Status

Throughput Model

Hardware-aware architecture throughput estimation — dense transformer with GQA, across H100, B200, TPU v5e, and TPU v5p.

Comparison

Layer Breakdown

Decode Analysis

Validation

Architecture

Batch

KV Length

Cross-Hardware Comparison

Hardware	Train tok/s	Prefill ms	Decode tok/s	Memory GB	Bottleneck

Training Throughput by Architecture (H100)

Inference TBT (Time Between Tokens) by Architecture (H100)

Inference TTFT (Time To First Token) by Architecture (H100)

Hardware

Per-Layer Time Breakdown (Training)

Training Operation Costs

Per-Layer Time Breakdown (Prefill)

Per-Layer Time Breakdown (Decode)

Decode Latency vs KV Cache Length

GQA Impact on Decode

Training Validation (H100, ≤25% error target)

Architecture	Predicted	Measured	Ratio	Error	Status

Decode Validation (H100, ≤25% error target)

Architecture	Predicted tok/s	Measured tok/s	Ratio	Error	Status