Job Details

View jobs in our app

Learn more about the app. Workinapps.com

AI Inference Performance Engineer

2026-05-13 Cango Alameda,CA

Description:

This role is open to candidates based in the San Francisco Bay Area, including San Francisco, the East Bay, South Bay/Silicon Valley, and the Peninsula

About the Role

We are building a GPU-native AI platform that provides model inference APIs, dedicated inference instances, and GPU infrastructure services for AI applications and agent workloads. Our platform supports multiple model categories, including:

large language models (LLMs)
speech models, including ASR and TTS
image generation and diffusion models

We are looking for a Senior AI Inference Performance Engineer to help us optimize model serving performance across these workloads on our GPU infrastructure. This role sits at the intersection of machine learning systems, GPU architecture, inference engines, CUDA optimization, and production serving infrastructure.

You will be responsible for improving the throughput, latency, stability, and cost efficiency of model inference workloads running on our platform. This includes tuning model serving stacks, profiling bottlenecks, optimizing GPU utilization, and working across both software and system layers to achieve best-in-class inference performance.

Responsibilities

Core Inference Optimization: Optimize performance for LLMs, speech, and image models by benchmarking and fine-tuning serving frameworks (e.g., vLLM, TensorRT-LLM, Triton) to maximize throughput, minimize latency, and reduce cost per inference.
Deep Profiling & Hardware Tuning: Analyze and resolve GPU bottlenecks across multiple layers—including CUDA kernel efficiency, KV cache behavior, and data movement—utilizing advanced observability tools like Nsight Systems and PyTorch Profiler.
System Architecture & Scalability: Elevate system-level performance across diverse deployment patterns (low-latency, high-throughput, multi-tenant) by refining model loading times, autoscaling behavior, request routing, and memory management.
Cross-Functional Collaboration: Partner with platform, infrastructure, and product teams to architect efficient serving pipelines for production APIs, establishing robust performance targets and capacity models.
Continuous Innovation: Stay at the forefront of the industry by actively integrating the latest advances in AI inference optimization, CUDA techniques, and open-source serving systems into production environments.

Required Qualifications

Education & Core Experience: Bachelor's, Master's, or PhD in Computer Science, Electrical Engineering, Machine Learning, or a related field, accompanied by 5+ years of experience in ML systems, GPU software, inference optimization, high-performance computing, or large-scale model serving.
Deep Learning & GPU Architecture: Deep understanding of transformer-based models and modern generative AI workloads, paired with strong hands-on expertise in NVIDIA GPU architecture, CUDA, and multi-level performance tuning (kernel, framework, and system).
Inference Frameworks & Tooling: Extensive experience with leading inference stacks—such as TensorRT, TensorRT-LLM, Triton Inference Server, vLLM, ONNX Runtime, PyTorch, SGLang, and TGI—alongside proficiency in modern GPU profiling and debugging tools.
Programming & Engineering Fundamentals: Strong programming skills in Python and C++, backed by solid software engineering principles and the ability to navigate complex model serving tradeoffs involving latency vs. throughput, memory footprint vs. concurrency, and precision vs. quality.
Production Infrastructure: Proven familiarity with deploying and scaling AI workloads in containerized and distributed production environments, including Kubernetes, Docker, and cloud or on-prem GPU clusters.

Job Details

View jobs in our app

AI Inference Performance Engineer

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: AI Inference Performance Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care