I'm partnered with a AI infrastructure startup building a new foundation for how large-scale AI systems run.
They're addressing fundamental limits in power, cost, and hardware by decoupling workloads from infrastructure and enabling heterogeneous compute across CPUs, GPUs, and emerging accelerators.
Strong early traction:
?? $80M Series A
?? Deployments with Fortune 500 + AI-native companies
?? Working directly with foundation labs and hyperscalers
The Role
This is a core distributed systems role focused on building the platform that runs AI workloads at scale.
You will build systems that schedule, route, and operate workloads across thousands of nodes in production.
Typical problems:
• Distributed scheduling and orchestration
• Resource allocation across large-scale systems
• Reliability, fault tolerance, and failure handling
You'll work across the stack with compilers, runtimes, and hardware to ensure performance and correctness.
What They're Looking For
• Proven ownership of distributed systems in production
• Strong Kubernetes experience
• Deep understanding of concurrency, failure modes, and system tradeoffs
• Strong programming in Go, C++, or Python
Ideal Additional Experience
• Experience with ML inference systems or performance-critical workloads
• Familiarity with scheduling, queues, or resource management systems
?? Does this take your interest? Lets chat -