Rough estimates. Decode is modeled as memory-bandwidth-bound and calibrated to measured 3090 numbers (dense 27B≈40, 35B-A3B≈87, 70B≈18 t/s on one card; 120B-A12B MoE≈67 t/s resident on a 4×3090 box). MoE decode tracks ACTIVE params, so a big MoE that fits in VRAM runs far faster than its total size suggests. Multi-GPU decode ≈ per-card bandwidth (pipeline-parallel; tensor-parallel without NVLink is comms-bound). Spill to system RAM drops you toward system-RAM bandwidth. Prefill scales with FP16 TFLOPS. KV-cache size = weights' sibling cost: it grows linearly with context and shrinks with KV quant (f16→q8_0 ≈ half, q4_0 ≈ quarter); GQA + MoE keep it modest. KV is f16-estimated per model and added to the footprint that must fit. Capability reads ONLY parts' specs — an unknown spec is treated as unknown, never 0.
| Item | Spec / notes | Qty | Unit $ | Source URL | Ext. |
|---|
Provenance: every price carries a freshness badge — ● sourced fresh (<7d), ◌ estimate needs verifying, ● stale (>21d). Estimates are flagged, never hidden; nothing is ever fabricated. Autofill: paste a product URL and click ⤓ — it reads structured data only, through a public CORS proxy, and says so when it can't.