We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.







Summary
Known as: Inference Engineer, GPU Kernel Engineer, Model Serving Engineer, Edge AI Engineer, Performance Engineer, On-Device ML Engineer, Model Optimization Engineer
Performance engineering for deployed models: model compression (quantization, pruning, distillation), GPU kernel optimization, serving runtime performance, and cost/latency tradeoffs. Turns model math into production speed — the difference between a demo and a viable product.
Specializations
Often unified with Serving Infrastructure and Model Operations under "Model Runtime" or "AI Production." Inference cost is becoming the binding unit-economics constraint for model providers — as usage scales, the difference between optimized and unoptimized serving is the difference between viable margins and subsidized access. This is shifting hiring weight toward performance engineering faster than most orgs anticipated.
Where the Work Lives
GPU kernel optimization, quantization, and hardware-specific performance engineering for deployed models.
Model compression decisions (e.g., quantization, pruning) directly affect quality and capability tradeoffs.
Owns serving runtime performance — latency, throughput, and cost at production scale.
Candidate Archetypes
Turns model math into fast kernels and low-bit deployments — post-training quantization, mixed precision, and hardware-specific tuning.
Owns batching, KV-cache efficiency, prefill/decode scheduling, and the latency/throughput frontier.
Ships constrained inference where power, memory, and thermals are mandatory constraints, not optional tradeoffs.
Company Scale
Critical as inference costs scale. Early-stage uses providers; growth+ optimizes for margins.
Featured Roles
If you’re hiring at the AI frontier, let’s talk.