Job Details

Software Engineer, ML Infrastructure - Training Platform

  2025-05-08     Scale AI     San Francisco,CA  
Description:

Scale is seeking an AI/ML Infrastructure Engineer to join our Machine Learning Infrastructure team to develop our Training Platform. In this role, you will collaborate closely with Machine Learning researchers to understand their needs and leverage your expertise and our compute resources to enhance experimentation throughput.


The ideal candidate should possess strong fundamentals in machine learning, backend system design, and prior experience in ML Infrastructure. Comfort with infrastructure, large-scale system design, and diagnosing model performance and system failures is essential.


You will:

  • Build highly available, observable, performant, and cost-effective APIs for model training.
  • Participate in our on-call process to ensure service availability.
  • Manage projects end-to-end, from requirements and scoping to design and implementation, within a collaborative, cross-functional environment.
  • Exercise good judgment in system and tool building, balancing build vs. buy decisions with cost considerations.

Ideally you'd have:

  • 4+ years of experience with machine learning training pipelines or inference services in production.
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience developing, deploying, and monitoring complex microservice architectures.
  • Proficiency in Python, Docker, Kubernetes, and Infrastructure as Code (e.g., Terraform).

Nice to haves:

  • Experience with LLM inference latency optimization techniques like kernel fusion, quantization, dynamic batching, etc.
  • Experience working with cloud platforms such as AWS or GCP.

Compensation packages include base salary, equity, and benefits. The salary range varies by location and other factors. Benefits include health, dental, vision, retirement, learning stipends, and generous PTO. Additional benefits may include commuter stipends.


Location-specific salary range in San Francisco, New York, Seattle: $160,000 — $225,600 USD.


Note: Our policy requires a 90-day waiting period before reconsidering candidates for the same role.


About Us:

At Scale, we aim to accelerate the transition to AI across industries. Our products power advanced LLMs, generative models, and computer vision models, trusted by leading AI companies and organizations worldwide. We promote an inclusive workplace and are committed to equal opportunity employment. For accommodations during the application process, contact ...@scale.com.


We adhere to the US Department of Labor's Pay Transparency and privacy policies. Personal data collected is used solely for employment-related purposes and managed according to our privacy policy.

#J-18808-Ljbffr


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search