Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Senior Site Reliability Engineer

2026-06-14 Hyperbolic Labs San Francisco,CA

Description:

Site Reliability Engineer

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product-critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7.

In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post-mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high-impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale.

Who You Are

Architected, deployed, and managed large-scale Kubernetes environments, including cluster administration, container orchestration, autoscaling, service discovery, and high-availability infrastructure to ensure reliability and scalability of mission-critical systems.
Led troubleshooting and performance optimization efforts across Kubernetes-based production environments, proactively identifying system bottlenecks, automating remediation workflows, and improving overall platform stability and uptime.
Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines
Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure

Preferred Qualifications

Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
Background in distributed systems, peer-to-peer networks, or decentralized infrastructure
Knowledge of multi-tenancy security patterns, container security, and runtime security tools
Experience with chaos engineering, fault injection, and resilience testing
Familiarity with cost optimization strategies for cloud infrastructure and GPU resources
Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs)
Background at companies like AWS, Google Cloud, Azure, or fast-growing infrastructure startups
Contributions to open-source reliability, observability, or security tools

Hyperbolic is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Job Details

View jobs in our app

Senior Site Reliability Engineer

Site Reliability Engineer

Who You Are

Preferred Qualifications

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Senior Site Reliability Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care