Mathematical exposition of decoder-only transformer architectures using Einstein summation notation. Covers multi-head attention, RoPE/ALiBi positional encoding, FlashAttention, KV caching, ...