Overview
We are building a distributed LLM inference network that combines idle GPU capacity from around the world into a single cohesive plane of compute for running large-language models like DeepSeek and Llama 4. At any given moment, we have over 5,000 GPUs and hundreds of terabytes of VRAM connected to the network. We are a small, well-funded team working on difficult, high-impact problems at the intersection of AI and distributed systems. We primarily work in-person from our office in downtown San Francisco.
Responsibilities
- Design and implement optimization techniques to increase model throughput and reduce latency across our suite of models
- Deploy and maintain large language models at scale in production environments
- Deploy new models as they are released by frontier labs
- Implement techniques like quantization, speculative decoding, and KV cache reuse
- Contribute regularly to open source projects such as SGLang and vLLM
- Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vLLM, SGLang, CUDA, and other libraries to debug ML performance issues
- Collaborate with the engineering team to bring new features and capabilities to our inference platform
- Develop robust and scalable infrastructure for AI model serving
- Create and maintain technical documentation for inference systems
Requirements
- 3+ years of experience writing high-performance, production-quality code
- Strong proficiency with Python and deep learning frameworks, particularly PyTorch
- Demonstrated experience with LLM inference optimization techniques
- Hands-on experience with SGLang and vLLM, with contributions to these projects strongly preferred
- Familiarity with Docker and Kubernetes for containerized deployments
- Experience with CUDA programming and GPU optimization
- Strong understanding of distributed systems and scalability challenges
- Proven track record of optimizing AI models for production environments
Nice to Have
- Familiarity with TensorRT and TensorRT-LLM
- Knowledge of vision models and multimodal AI systems
- Experience implementing techniques like quantization and speculative decoding
- Contributions to open source machine learning projects
- Experience with large-scale distributed computing
Compensation
We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is $180,000 - $250,000, plus competitive equity and benefits including:
- Full healthcare coverage
- Quarterly offsites
- Flexible PTO
Skills: pytorch, gpu optimization, deep learning frameworks, sglang, vllm, cuda programming, machine learning, python, llm