Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Senior SRE & Infra Engineer (GPU Cluster Platform Reliability & Infrastructure Engineer)

2025-11-04 Macpower Digital Assets Edge San Francisco,CA

Description:

This hybrid role spans across platform reliability and infrastructure engineering. You'll be instrumental in ensuring high availability, fault tolerance, and performance across internal research and external customers' GPU cluster environments. Responsibilities include automating GPU cluster onboarding, enhancing monitoring, logging, and security systems, and developing new backend features.
Required Skills and Certifications:

Proven experience with monitoring tools (e.g., Prometheus, Grafana) and incident management practice.
Strong skills in infrastructure automation with Ansible, Terraform, or similar.
Deep understanding of logging frameworks, alerting systems, and proactive monitoring solutions.
Proficiency in Python for developing automation scripts, REST APIs, and backend support tools.
Hands-on experience with Kubernetes and cloud platforms (GCP preferred).
Knowledge of high-performance networking and real-time systems.

Job Details

View jobs in our app

Senior SRE & Infra Engineer (GPU Cluster Platform Reliability & Infrastructure Engineer)

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Senior SRE & Infra Engineer (GPU Cluster Platform Reliability & Infrastructure Engineer)

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care