This hybrid role spans across platform reliability and infrastructure engineering. You'll be instrumental in ensuring high availability, fault tolerance, and performance across internal research and external customers' GPU cluster environments. Responsibilities include automating GPU cluster onboarding, enhancing monitoring, logging, and security systems, and developing new backend features.
Required Skills and Certifications: