Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts

← ATOMS & ENERGYUSERS & MARKETS →

← Back

Post-Training

Shapes model behavior via RL

Post-Training

Summary

Known as: Member of Technical Staff, Research Engineer, RL Engineer

Modifies model weights to turn base models into deployment-ready systems: instruction-following, stronger reasoning on multi-step tasks, steerable behavior, and safer/more reliable outputs. Uses fine-tuning and preference-based optimization (often with reinforcement learning) to shape behavior. This is where many "what it's like to use" changes occur.

Specializations

Instruction Tuning — Supervised fine-tuning on demonstrations to teach instruction-following, format compliance, and task-specific behaviors. The baseline for deployment-ready models.

RLHF / DPO / Preference Optimization — Preference-based optimization (RLHF, DPO, and variants) shapes behavior using human preference data. A parallel and increasingly dominant track — RL from verifiable rewards (RLVR) — uses clean reward signals (code compilation, test passage, mathematical proof verification) instead of human preferences. Different data requirements and failure modes (reward hacking on verifiable tasks), but the same optimization infrastructure and often the same team.

Reasoning & Chain-of-Thought Training — Training models to reason step-by-step, use tools, and solve multi-step problems. The primary consumer of RLVR: verifiable reasoning tasks (math, code, formal proofs) provide the clean reward signal that scales RL. Test-time compute — scaling search and verification at inference — is now a distinct scaling axis alongside pre-training scale. Uses methods like GRPO that operate on the same infrastructure and optimization loops as pre-training; the boundary between post-training and training is dissolving here faster than anywhere else. A primary differentiator at frontier labs.

Safety Tuning — Modifying model behavior for safety: refusals, boundary enforcement, harmful content reduction, and safety-capability tradeoff management. Translates safety eval findings into weight changes — RLHF runs targeted at specific failure modes, with constant tension between safety and capability regression.

The strategic split between pre-training compute and RL compute is a live frontier decision — labs are only beginning to scale RL compute and expect to increase it dramatically. This changes the hiring weight: more RL infrastructure and reward engineering, less data-mixing optimization. The surface area of RL environments is expanding fast — computer use (GUI navigation, web browsers, desktop applications) is now a distinct training domain alongside code, math, and tool use.

Where the Work Lives

[1]Substrate

[2]Compute

[3]Intelligence

Primary

Modifies model weights via RLHF, DPO, and RL to shape behavior, reasoning, and safety.

[4]Systems

Secondary

Defines how models behave in deployment — safety tuning, instruction following, and behavioral guardrails.

[5]Distribution

Candidate Archetypes