Companies you'll love to work for

companies
Jobs

ML Infra Engineer (Distributed Training)

Rhoda AI

Rhoda AI

Software Engineering, Data Science
Palo Alto, CA, USA
Posted on Mar 17, 2026

Location

Palo Alto

Employment Type

Full time

Department

Research

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We're hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration.

What You'll Do

Own performance at scale

  • Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data (vision, proprioception, actions, language, video)

  • Build a repeatable workflow for performance attribution: step-time breakdown (compute vs collectives/communication vs dataloader vs checkpointing), scaling curves and bottleneck identification at different GPU counts

  • Drive measurable gains in:

    • Distributed efficiency (overlap, bucket sizing, rank/topology mapping, parallelism strategy)

    • Compute efficiency (kernel hotspots, fusion, attention performance, framework overhead)

    • Memory efficiency (activation checkpointing, packing/bucketing, reduced padding waste)

Make performance observable and durable

  • Create "source of truth" metrics and dashboards for both: per-job performance ("why is this run slow?") and fleet-wide performance ("where are we losing GPU-hours this week?")

  • Build automated performance regression detection: microbenchmark suite per model family, CI/perf gates or lightweight canary runs, "golden configs" and standard launch templates

Partner deeply with researchers (no silos)

  • Work closely with researchers and research engineers to translate model changes into scalable implementations

  • Provide guidance on training strategy tradeoffs relevant to robotics world models (sequence lengths, rollout/eval cadence, variable-length multimodal data, etc.)

  • Reduce the operational burden on researchers so they can focus on model quality and robotic behavior

Collaborate on cluster efficiency (as part of the infra team)

Partner with infra/SRE to reduce wasted GPU-hours from:

  • Stragglers and degraded nodes

  • Network health issues

  • Checkpoint stalls and storage bottlenecks

  • Scheduler placement issues for large distributed jobs

What We're Looking For

  • Significant experience delivering distributed training performance improvements in production research environments (large-scale GPU training strongly preferred)

  • Strong hands-on experience with modern training stacks (e.g., PyTorch; familiarity with JAX a plus)

  • Deep understanding of distributed training concepts and tradeoffs: sharded training (FSDP/ZeRO-style), tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing and improving collective communication performance

  • Strong debugging and measurement instincts: you can turn ambiguous "it's slow" into a clear bottleneck + experiment plan + validated fix

  • Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy

Nice to Have (But Not Required)

  • Experience with GPU kernel-level performance work (CUDA/Triton), fused ops, compiler/graph capture

  • Experience with multimodal/video training and variable-length sequence packing/bucketing

  • Experience building observability systems for ML training (metrics/logs/traces + dashboards + alerting)

  • Familiarity with large-cluster scheduling or topology-aware placement (Slurm/K8s/HPC environments)

Why This Role

  • Direct impact on model iteration speed — your work translates directly into faster research cycles and better robotic capability

  • Work at the frontier of large-scale training for real-world robotics, not toy benchmarks

  • Tight collaboration between systems, research, and infrastructure (no silos)

  • High ownership in a small, ambitious team building foundational technology

  • Meaningful leverage: improvements you make compound across every training run the research team executes