This role is open to candidates based in the San Francisco Bay Area, including San Francisco, the East Bay, South Bay/Silicon Valley, and the Peninsula
About the Role
We are building a GPU-native AI platform that provides model inference APIs, dedicated inference instances, and GPU infrastructure services for AI applications and agent workloads. Our platform supports multiple model categories, including:
We are looking for a Senior AI Inference Performance Engineer to help us optimize model serving performance across these workloads on our GPU infrastructure. This role sits at the intersection of machine learning systems, GPU architecture, inference engines, CUDA optimization, and production serving infrastructure.
You will be responsible for improving the throughput, latency, stability, and cost efficiency of model inference workloads running on our platform. This includes tuning model serving stacks, profiling bottlenecks, optimizing GPU utilization, and working across both software and system layers to achieve best-in-class inference performance.
Responsibilities
Required Qualifications