Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts

← ATOMS & ENERGYUSERS & MARKETS →

← Back

Training Infrastructure

Builds systems that train at scale

Training Infrastructure

Summary

Known as: ML Systems Engineer, Distributed Training Engineer, GPU/TPU Kernel Engineer, Performance Engineer, Network Engineer (AI/HPC), Fabric Engineer

Builds, maintains, and improves the systems that researchers use to train models. GPU clusters, distributed training parallelism, network communication, fault tolerance, and the operational layer that keeps multi-week training runs alive.

Specializations

Distributed Training Systems — GPU clusters, parallelism strategies (data/pipeline/tensor), fault tolerance, checkpoint management, and hardware correctness (GPU error detection, ECC monitoring, fault isolation). The systems layer that makes large-scale training possible.

Networking & Communication — InfiniBand, RoCE, NVLink/NVSwitch topologies, collective communication libraries (NCCL), congestion control, and fabric design for training clusters. Network performance is often the binding constraint on training throughput at scale.

Training Run Operations — 24/7 monitoring of active runs, anomaly response, crash recovery, hardware fault isolation, and checkpoint management. At frontier labs this operational responsibility is formalized and cross-functional; senior engineers rotate through on-call to keep multi-week runs alive.

Where the Work Lives

[1]Substrate

[2]Compute

Primary

Builds GPU clusters, distributed training systems, and the operational layer for multi-week runs.

[3]Intelligence

Secondary

Directly enables training at scale — fault tolerance, throughput, and determinism for research teams.

[4]Systems

[5]Distribution

Candidate Archetypes

Ludvik Lenn

OpenAI

Distributed training

Builds fault-tolerant training services, checkpoint lifecycles, determinism controls, and cluster-level throughput levers.

Liam Raynott

Anthropic

Network & comms

Owns GPU-cluster communication paths and collective performance — InfiniBand, NVLink, NCCL — where comms become the binding constraint.

Shawn Williams

xAI

Training ops / ML SRE

Runs the 24/7 operational layer: anomaly response, crash recovery, and keeping multi-week runs alive.

Company Scale

Early-Stage

Rare

Growth

Occasional

Enterprise

Primary

Only orgs training their own models. Early-stage uses cloud providers unless the company is founded specifically to build training infrastructure.

Featured Roles

Partnership Inquiries

We partner selectively with teams hiring for roles where the right person changes the trajectory.