Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts

← ATOMS & ENERGYUSERS & MARKETS →

← Back

Inference Optimization

Makes models fast and cheap

Inference Optimization

Summary

Known as: Inference Engineer, GPU Kernel Engineer, Model Serving Engineer, Edge AI Engineer, Performance Engineer, On-Device ML Engineer, Model Optimization Engineer

Performance engineering for deployed models: model compression (quantization, pruning, distillation), GPU kernel optimization, serving runtime performance, and cost/latency tradeoffs. Turns model math into production speed — the difference between a demo and a viable product.

Specializations

Runtime Optimization — Batching strategies, KV-cache optimization, speculative decoding, prefill/decode scheduling — implemented in serving runtimes like vLLM and SGLang — and the runtime tuning that turns raw GPU capacity into low-latency, high-throughput inference. Long-context serving (1M+ tokens) is becoming a strategic bottleneck — KV-cache memory management at scale is the primary constraint on context-window expansion, which directly gates new classes of agent capability.

Kernel & Hardware Optimization — GPU kernels, quantization (PTQ, quantization-aware training, INT8/INT4/mixed-precision), pruning, structured sparsity, compilation, and hardware-specific optimization. Turns low-level performance gains into product leverage — milliseconds into margins.

On-Device / Edge ML — Model compilation for phones, wearables, and vehicles. Neural architecture search for constrained compute, hardware-software co-design, and the tradeoffs between quality, latency, power, and memory that define on-device deployment.

Often unified with Serving Infrastructure and Model Operations under "Model Runtime" or "AI Production." Inference cost is becoming the binding unit-economics constraint for model providers — as usage scales, the difference between optimized and unoptimized serving is the difference between viable margins and subsidized access. This is shifting hiring weight toward performance engineering faster than most orgs anticipated.

Where the Work Lives

[1]Substrate

[2]Compute

Primary

GPU kernel optimization, quantization, and hardware-specific performance engineering for deployed models.

[3]Intelligence

Secondary

Model compression decisions (e.g., quantization, pruning) directly affect quality and capability tradeoffs.

[4]Systems

Primary

Owns serving runtime performance — latency, throughput, and cost at production scale.

[5]Distribution

Candidate Archetypes

Philip Wagener

Fireworks

Kernel & quantization

Turns model math into fast kernels and low-bit deployments — post-training quantization, mixed precision, and hardware-specific tuning.

Jon Richards

NVIDIA

Serving runtime

Owns batching, KV-cache efficiency, prefill/decode scheduling, and the latency/throughput frontier.

Nate Walker

Together

Edge / on-device

Ships constrained inference where power, memory, and thermals are mandatory constraints, not optional tradeoffs.

Company Scale

Early-Stage

Occasional

Growth

Common

Enterprise

Primary

Critical as inference costs scale. Early-stage uses providers; growth+ optimizes for margins.

Featured Roles

Partnership Inquiries

We partner selectively with teams hiring for roles where the right person changes the trajectory.