ML Infra Engineer - Platform
Rhoda AI
Location
Palo Alto
Employment Type
Full time
Department
Software
At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We train large models on a large NVIDIA B200 GPU cluster. The cluster runs at high capacity, and our next step is to make it predictable, reliable, and measurably efficient so researchers spend less time babysitting jobs and more time advancing model capability and real-world robot behavior. We're looking for a Cluster Reliability / SRE owner to help us build the operational foundation.
What You'll Do
Own fleet health and node reliability
Build and operate node health checks (GPU, CPU, memory, NIC, storage) and automated health scoring
Detect and mitigate stragglers (e.g., thermal/power throttling, ECC issues, network degradation)
Implement automatic quarantine/drain policies and safe reintegration workflows
Drive uptime improvements through preventative maintenance, root-cause analysis, and runbooks
Build observability and fast diagnosis
Establish "source of truth" telemetry for cluster + jobs:
GPU health and performance signals (clocks, throttling, error rates)
Network and storage performance indicators (latency, throughput, tail behavior)
Job-level health (retries, hangs, step-time anomalies)
Create dashboards and alerts that answer:
"Why did this job slow down or hang?"
"Where did our GPU-hours go this week?"
"Which nodes/racks are degrading performance?"
Standardize logging/metrics patterns for training jobs to make triage consistent and fast.
Improve scheduling, placement, and utilization
Reduce wasted capacity caused by fragmentation and poor placement
Implement or tune policies for: topology-aware placement / constraints for large distributed runs, backfilling and queue discipline, safe preemption/requeue behaviors
Partner with researchers to ensure scheduling supports: long-running pretraining, evaluation runs and ablations, fast iteration loops
Automation and operational excellence
Eliminate manual toil with automation:
Safe auto-retry / auto-resume patterns
Hang detection and automated triage signals
Templates/guardrails for job submission
Own incident response practices:
Clear escalation paths
Postmortems with action items
Measurable reductions in repeat incidents
Participate in an on-call rotation (with a strong automation-first culture).
Partner closely with research and performance engineering
Work tightly with researchers and ML systems/perf engineers to identify system-level causes of training inefficiency (I/O stalls, stragglers, NCCL hangs, etc.). Provide a stable, observable platform that enables deep performance optimization work to be effective and repeatable.
What We're Looking For
Strong experience operating production systems with an SRE/reliability mindset (automation-first, measurable outcomes, incident discipline)
Experience with large-scale compute environments (GPU clusters, HPC, distributed compute, or cloud ML platforms)
Solid Linux fundamentals and comfort debugging across layers: kernel/driver, networking, storage, runtime, and application
Experience building observability systems: metrics, logs, traces, alerting, dashboards, and meaningful SLOs
Ability to diagnose ambiguous issues and drive them to resolution with clear hypotheses and experiments
Strong ownership and communication — able to coordinate across research and engineering in a fast-moving environment
Nice to Have (But Not Required)
Experience with HPC schedulers and placement systems (e.g., Slurm or similar)
Familiarity with GPU fleet health: ECC, throttling, NVLink/PCIe behavior, driver issues, burn-in practices
Experience debugging distributed training failure modes (e.g., hangs, stragglers, network-related stalls)
Familiarity with Kubernetes-based ML platforms, Ray, or workflow orchestration systems
Experience with high-throughput storage systems and their performance failure modes (tail latency, hotspots)
Prior exposure to ML training environments (PyTorch/JAX, even if you're not writing model code day-to-day)
Why This Role
Reliability is throughput — your work directly increases model iteration speed and research velocity across the entire team
Leverage at scale: improvements to cluster reliability and placement translate directly into meaningful effective compute capacity
High ownership at the frontier — help establish the infrastructure foundation as we build out a dedicated cluster reliability team supporting foundational models for real robots