Job Details

Staff Site Reliability Engineer, Storage

  2025-05-27     Crusoe     San Francisco,CA  
Description:

Staff Site Reliability Engineer, Storage

Join to apply for the Staff Site Reliability Engineer, Storage role at Crusoe.

Crusoe is building the world's favorite AI-first cloud infrastructure company. We're pioneering purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Our mission is to align the future of computing with the future of the climate, with a platform recognized for reliability and performance, powered by clean energy.

About This Role

Our Site Reliability Engineering (SRE) team maintains the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE ensures the availability, performance, and scalability of Crusoe's cloud storage products, supporting AI and HPC workloads. You will build and optimize distributed, fault-tolerant storage systems at scale to support our sustainable cloud platform.

Responsibilities

  1. Develop automation and self-healing tools for our distributed storage infrastructure, including block, file, and object storage systems.
  2. Drive reliability initiatives around data replication, encryption, backup, restore strategies, and failover mechanisms.
  3. Collaborate with storage engineers to implement high-performance NVMe and SSD-backed volumes supporting large-scale AI compute clusters.
  4. Support user-facing storage services focusing on availability, performance, and error budget adherence.
  5. Investigate and resolve storage incidents using telemetry, logs, and profiling; diagnose low-level I/O issues with hardware and kernel teams.
  6. Contribute to designing fault-tolerant, scalable storage architectures for AI cloud environments.

Qualifications

  • 8+ years of experience in Storage SRE, systems, or storage engineering.
  • Hands-on experience with distributed storage systems like Ceph, GlusterFS, OpenEBS.
  • Proficiency in programming languages such as Go, Python, Java, or C.
  • Experience with Infrastructure as Code tools like Terraform, Ansible, or Puppet.
  • Deep knowledge of Linux internals, especially I/O, memory management, and storage scheduling.
  • Familiarity with storage protocols such as NFS, SMB, iSCSI, NVMe-oF.
  • Experience with container orchestration platforms like Kubernetes and Docker.
  • Strong troubleshooting, incident response, and documentation skills.
  • Experience managing storage services on cloud platforms (AWS, GCP, Azure).
  • Excellent communication skills and ability to pass background checks.

Benefits

  • Hybrid work schedule
  • Competitive salary and Restricted Stock Units
  • Health insurance, HSA contributions, paid parental leave, life insurance, disability coverage
  • Additional perks: Teladoc, 401(k) match, paid time off, cell reimbursement, tuition reimbursement, wellness subscriptions, legal services, commuter benefits

Compensation

Up to $250,000/year plus bonus and RSUs, based on experience and internal equity.

Additional Information

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to legally protected statuses.

Job Details

  • Seniority level: Mid-Senior level
  • Employment type: Full-time
  • Job function: Engineering and IT
#J-18808-Ljbffr


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search