Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts
Kumar Chellapilla
Kumar ChellapillaVPE
Jennifer Anderson
Jennifer AndersonVPE / Stanford PhD
Thuan Pham
Thuan PhamCTO
Akash Garg
Akash GargCTO
Linghao Zhang
Linghao ZhangResearch Engineer
Wayne Chang
Wayne ChangEarly FB Engineer
Indrajit Khare
Indrajit KhareEM & Head of Product
← ATOMS & ENERGYUSERS & MARKETS →
← Back

Training Infrastructure

Builds systems that train at scale
Training Infrastructure

Known as: ML Systems Engineer, Distributed Training Engineer, GPU/TPU Kernel Engineer, Performance Engineer, Network Engineer (AI/HPC), Fabric Engineer

Builds, maintains, and improves the systems that researchers use to train models. GPU clusters, distributed training parallelism, network communication, fault tolerance, and the operational layer that keeps multi-week training runs alive.

Specializations

Distributed Training Systems GPU clusters, parallelism strategies (data/pipeline/tensor), fault tolerance, checkpoint management, and hardware correctness (GPU error detection, ECC monitoring, fault isolation). The systems layer that makes large-scale training possible.
Networking & Communication InfiniBand, RoCE, NVLink/NVSwitch topologies, collective communication libraries (NCCL), congestion control, and fabric design for training clusters. Network performance is often the binding constraint on training throughput at scale.
Training Run Operations 24/7 monitoring of active runs, anomaly response, crash recovery, hardware fault isolation, and checkpoint management. At frontier labs this operational responsibility is formalized and cross-functional; senior engineers rotate through on-call to keep multi-week runs alive.
[1]Substrate
[2]Compute
Primary

Builds GPU clusters, distributed training systems, and the operational layer for multi-week runs.

[3]Intelligence
Secondary

Directly enables training at scale — fault tolerance, throughput, and determinism for research teams.

[4]Systems
[5]Distribution
Ludvik Lenn
Ludvik Lenn
OpenAI
Distributed training

Builds fault-tolerant training services, checkpoint lifecycles, determinism controls, and cluster-level throughput levers.

Liam Raynott
Liam Raynott
Anthropic
Network & comms

Owns GPU-cluster communication paths and collective performance — InfiniBand, NVLink, NCCL — where comms become the binding constraint.

Shawn Williams
Shawn Williams
xAI
Training ops / ML SRE

Runs the 24/7 operational layer: anomaly response, crash recovery, and keeping multi-week runs alive.

Early-Stage
Rare
Growth
Occasional
Enterprise
Primary

Only orgs training their own models. Early-stage uses cloud providers unless the company is founded specifically to build training infrastructure.

Let’s Find Your Next Builder

If you’re hiring at the AI frontier, let’s talk.