Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Staff Site Reliability Engineer, Storage

2025-05-27 Crusoe San Francisco,CA

Description:

Staff Site Reliability Engineer, Storage

Join to apply for the Staff Site Reliability Engineer, Storage role at Crusoe.

Crusoe is building the world's favorite AI-first cloud infrastructure company. We're pioneering purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Our mission is to align the future of computing with the future of the climate, with a platform recognized for reliability and performance, powered by clean energy.

About This Role

Our Site Reliability Engineering (SRE) team maintains the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE ensures the availability, performance, and scalability of Crusoe's cloud storage products, supporting AI and HPC workloads. You will build and optimize distributed, fault-tolerant storage systems at scale to support our sustainable cloud platform.

Responsibilities

Develop automation and self-healing tools for our distributed storage infrastructure, including block, file, and object storage systems.
Drive reliability initiatives around data replication, encryption, backup, restore strategies, and failover mechanisms.
Collaborate with storage engineers to implement high-performance NVMe and SSD-backed volumes supporting large-scale AI compute clusters.
Support user-facing storage services focusing on availability, performance, and error budget adherence.
Investigate and resolve storage incidents using telemetry, logs, and profiling; diagnose low-level I/O issues with hardware and kernel teams.
Contribute to designing fault-tolerant, scalable storage architectures for AI cloud environments.

Qualifications

8+ years of experience in Storage SRE, systems, or storage engineering.
Hands-on experience with distributed storage systems like Ceph, GlusterFS, OpenEBS.
Proficiency in programming languages such as Go, Python, Java, or C.
Experience with Infrastructure as Code tools like Terraform, Ansible, or Puppet.
Deep knowledge of Linux internals, especially I/O, memory management, and storage scheduling.
Familiarity with storage protocols such as NFS, SMB, iSCSI, NVMe-oF.
Experience with container orchestration platforms like Kubernetes and Docker.
Strong troubleshooting, incident response, and documentation skills.
Experience managing storage services on cloud platforms (AWS, GCP, Azure).
Excellent communication skills and ability to pass background checks.

Benefits

Hybrid work schedule
Competitive salary and Restricted Stock Units
Health insurance, HSA contributions, paid parental leave, life insurance, disability coverage
Additional perks: Teladoc, 401(k) match, paid time off, cell reimbursement, tuition reimbursement, wellness subscriptions, legal services, commuter benefits

Compensation

Up to $250,000/year plus bonus and RSUs, based on experience and internal equity.

Additional Information

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to legally protected statuses.

Job Details

Seniority level: Mid-Senior level
Employment type: Full-time
Job function: Engineering and IT

#J-18808-Ljbffr

Job Details

View jobs in our app

Staff Site Reliability Engineer, Storage

Staff Site Reliability Engineer, Storage

About This Role

Responsibilities

Qualifications

Benefits

Compensation

Additional Information

Job Details

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Staff Site Reliability Engineer, Storage

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care