Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts
Kumar Chellapilla
Kumar ChellapillaVPE
Jennifer Anderson
Jennifer AndersonVPE / Stanford PhD
Thuan Pham
Thuan PhamCTO
Akash Garg
Akash GargCTO
Linghao Zhang
Linghao ZhangResearch Engineer
Wayne Chang
Wayne ChangEarly FB Engineer
Indrajit Khare
Indrajit KhareEM & Head of Product
← ATOMS & ENERGYUSERS & MARKETS →
← Back

Inference Optimization

Makes models fast and cheap
Inference Optimization

Known as: Inference Engineer, GPU Kernel Engineer, Model Serving Engineer, Edge AI Engineer, Performance Engineer, On-Device ML Engineer, Model Optimization Engineer

Performance engineering for deployed models: model compression (quantization, pruning, distillation), GPU kernel optimization, serving runtime performance, and cost/latency tradeoffs. Turns model math into production speed — the difference between a demo and a viable product.

Specializations

Runtime Optimization Batching strategies, KV-cache optimization, speculative decoding, prefill/decode scheduling — implemented in serving runtimes like vLLM and SGLang — and the runtime tuning that turns raw GPU capacity into low-latency, high-throughput inference. Long-context serving (1M+ tokens) is becoming a strategic bottleneck — KV-cache memory management at scale is the primary constraint on context-window expansion, which directly gates new classes of agent capability.
Kernel & Hardware Optimization GPU kernels, quantization (PTQ, quantization-aware training, INT8/INT4/mixed-precision), pruning, structured sparsity, compilation, and hardware-specific optimization. Turns low-level performance gains into product leverage — milliseconds into margins.
On-Device / Edge ML Model compilation for phones, wearables, and vehicles. Neural architecture search for constrained compute, hardware-software co-design, and the tradeoffs between quality, latency, power, and memory that define on-device deployment.

Often unified with Serving Infrastructure and Model Operations under "Model Runtime" or "AI Production." Inference cost is becoming the binding unit-economics constraint for model providers — as usage scales, the difference between optimized and unoptimized serving is the difference between viable margins and subsidized access. This is shifting hiring weight toward performance engineering faster than most orgs anticipated.

[1]Substrate
[2]Compute
Primary

GPU kernel optimization, quantization, and hardware-specific performance engineering for deployed models.

[3]Intelligence
Secondary

Model compression decisions (e.g., quantization, pruning) directly affect quality and capability tradeoffs.

[4]Systems
Primary

Owns serving runtime performance — latency, throughput, and cost at production scale.

[5]Distribution
Philip Wagener
Philip Wagener
Fireworks
Kernel & quantization

Turns model math into fast kernels and low-bit deployments — post-training quantization, mixed precision, and hardware-specific tuning.

Jon Richards
Jon Richards
NVIDIA
Serving runtime

Owns batching, KV-cache efficiency, prefill/decode scheduling, and the latency/throughput frontier.

Nate Walker
Nate Walker
Together
Edge / on-device

Ships constrained inference where power, memory, and thermals are mandatory constraints, not optional tradeoffs.

Early-Stage
Occasional
Growth
Common
Enterprise
Primary

Critical as inference costs scale. Early-stage uses providers; growth+ optimizes for margins.

Let’s Find Your Next Builder

If you’re hiring at the AI frontier, let’s talk.