Companies you'll love to work for

companies
Jobs

ML Infra Engineer - Platform

Rhoda AI

Rhoda AI

Software Engineering, Data Science
Palo Alto, CA, USA
Posted on Mar 17, 2026

Location

Palo Alto

Employment Type

Full time

Department

Software

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We train large models on a large NVIDIA B200 GPU cluster. The cluster runs at high capacity, and our next step is to make it predictable, reliable, and measurably efficient so researchers spend less time babysitting jobs and more time advancing model capability and real-world robot behavior. We're looking for a Cluster Reliability / SRE owner to help us build the operational foundation.

What You'll Do

Own fleet health and node reliability

  • Build and operate node health checks (GPU, CPU, memory, NIC, storage) and automated health scoring

  • Detect and mitigate stragglers (e.g., thermal/power throttling, ECC issues, network degradation)

  • Implement automatic quarantine/drain policies and safe reintegration workflows

  • Drive uptime improvements through preventative maintenance, root-cause analysis, and runbooks

Build observability and fast diagnosis

Establish "source of truth" telemetry for cluster + jobs:

  • GPU health and performance signals (clocks, throttling, error rates)

  • Network and storage performance indicators (latency, throughput, tail behavior)

  • Job-level health (retries, hangs, step-time anomalies)

Create dashboards and alerts that answer:

  • "Why did this job slow down or hang?"

  • "Where did our GPU-hours go this week?"

  • "Which nodes/racks are degrading performance?"

Standardize logging/metrics patterns for training jobs to make triage consistent and fast.

Improve scheduling, placement, and utilization

  • Reduce wasted capacity caused by fragmentation and poor placement

  • Implement or tune policies for: topology-aware placement / constraints for large distributed runs, backfilling and queue discipline, safe preemption/requeue behaviors

  • Partner with researchers to ensure scheduling supports: long-running pretraining, evaluation runs and ablations, fast iteration loops

Automation and operational excellence

Eliminate manual toil with automation:

  • Safe auto-retry / auto-resume patterns

  • Hang detection and automated triage signals

  • Templates/guardrails for job submission

Own incident response practices:

  • Clear escalation paths

  • Postmortems with action items

  • Measurable reductions in repeat incidents

Participate in an on-call rotation (with a strong automation-first culture).

Partner closely with research and performance engineering

Work tightly with researchers and ML systems/perf engineers to identify system-level causes of training inefficiency (I/O stalls, stragglers, NCCL hangs, etc.). Provide a stable, observable platform that enables deep performance optimization work to be effective and repeatable.

What We're Looking For

  • Strong experience operating production systems with an SRE/reliability mindset (automation-first, measurable outcomes, incident discipline)

  • Experience with large-scale compute environments (GPU clusters, HPC, distributed compute, or cloud ML platforms)

  • Solid Linux fundamentals and comfort debugging across layers: kernel/driver, networking, storage, runtime, and application

  • Experience building observability systems: metrics, logs, traces, alerting, dashboards, and meaningful SLOs

  • Ability to diagnose ambiguous issues and drive them to resolution with clear hypotheses and experiments

  • Strong ownership and communication — able to coordinate across research and engineering in a fast-moving environment

Nice to Have (But Not Required)

  • Experience with HPC schedulers and placement systems (e.g., Slurm or similar)

  • Familiarity with GPU fleet health: ECC, throttling, NVLink/PCIe behavior, driver issues, burn-in practices

  • Experience debugging distributed training failure modes (e.g., hangs, stragglers, network-related stalls)

  • Familiarity with Kubernetes-based ML platforms, Ray, or workflow orchestration systems

  • Experience with high-throughput storage systems and their performance failure modes (tail latency, hotspots)

  • Prior exposure to ML training environments (PyTorch/JAX, even if you're not writing model code day-to-day)

Why This Role

  • Reliability is throughput — your work directly increases model iteration speed and research velocity across the entire team

  • Leverage at scale: improvements to cluster reliability and placement translate directly into meaningful effective compute capacity

  • High ownership at the frontier — help establish the infrastructure foundation as we build out a dedicated cluster reliability team supporting foundational models for real robots