Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Staff ML Ops Engineer

2026-05-11 Albert Invent Oakland,CA

Description:

About the roleAs our Backend & Infrastructure Engineer, you will architect and build the core systems that power everything our AI/ML team delivers—the APIs, infrastructure, and distributed systems that make intelligent capabilities possible at scale. This is a foundational role:you'llshape how AI gets built and shipped here.We are seeking a highly motivated and talented individual with deep expertise in Python backend development, Kubernetes, and distributed systems. You'll be embedded with ML engineers and researchers, building robust systems that turn ambitious AI ideas into production realities—whether that's powering agent-based workflows, scaling inference, or enabling scientific computing pipelines. The infrastructure you build will directly enable researchers at the world's largest chemical and materials companies to leverage AI in ways that weren't possible before—accelerating discovery, enabling inverse design of novel materials, and transforming how science gets done.What you'll doDesign, deploy, and maintain Kubernetes infrastructure supporting AI/ML workloadsManage containerized services, autoscaling, networking, and resource optimizationDesign and build high-performance Python APIs and services using FastAPI or similar frameworksArchitect backend systems for scalability, reliability, and low latencyBuild integrations between AI/ML systems and the broader Albert platformDistributed Systems:Build and operate distributed systems that handle compute-intensive and high-throughput workloadsDesign for fault tolerance, graceful degradation, and horizontal scalabilityImplement async workflows, job queues, and task orchestration as neededData Infrastructure:Architect and maintain data pipelines and storage systems supporting AI/ML workflowsWork with vector databases, caches, and other data stores as required by ML systemsEnsure efficient data access patterns for training and inference workloadsReliability & Operations:Implement observability including logging, metrics, tracing, and alertingOwn system reliability—troubleshoot issues, conduct post-mortems, and continuously improveDesign CI/CD pipelines and promote automation best practicesImplement infrastructure-as-code practices using Terraform, Helm, ArgoCd, Pulumi, or similar toolsPartner closely with ML engineers to understand requirements and deliver production-ready infrastructureTranslate ML prototypes and research code into scalable, maintainable systemsContribute to technical decisions that shape the team's architectureYou will haveStrong Kubernetes and cloud infrastructure experienceA builder's mindset—you want to create foundational systems that others build onGenuine interest in science and technology; curiosity about how your work enables scientific discoveryA commitment to building systems that are reliable, maintainable, and scalableA degree in Computer Science or a related field with 7+ years of industry experience (Bachelor's) or 5+ years (Master's or PhD) in software engineeringExperience supporting AI/ML teams or deploying ML systems in productionExperience with GPU workloads and schedulingAdvanced proficiency in Python including async programming and performance optimizationDeep experience with Kubernetes—cluster management, networking, autoscaling, and troubleshootingStrong background in distributed systems and microservices architectureExperience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-codeProficiency in REST API development using FastAPI, Flask, or similarExperience with containerization and CI/CD pipelinesTrack record of operating production systems at scalePreferred/Bonus PointsFamiliarity with scientific computing or research environmentsBackground in or curiosity about chemistry, materials science, or related fieldsFamiliarity with data engineering tools (Airflow, Dagster, or similar)Experience with vector databases or search infrastructureExpertise in observability tools (Prometheus, Grafana, Datadog)Experience with message queues and event-driven architectures (Kafka, Redis, RabbitMQ)Contributions to open-source projectsExperience mentoring engineersWhy Albert?We have a huge impact. Albert is a growing team with a big reach. Our Platform facilitates the invention of materials for tens of thousands of companies and hundreds of thousands of applications - from coatings used on rockets to adhesives used in electric vehicles to 3D printed medical devices. We love distributed teams. Albert's home-base is in the California Bay Area, but we have multiple offices and employees sprinkled around the globe. In fact, over 50% of our employees work outside of California! An international remote culture is in our DNA. We care about you. Albert works hard to create a positive environment for our employees, and we think your life outside of work is important too. We work hard and we play hard. We value diversity. Growing and maintaining our inclusive and diverse team matters to us. We are committed to being a company where our employees feel comfortable bringing their authentic selves to work and have the ability to be successful -- every day. We're always looking for humble, sharp, and creative folks to join the Albert team. If you think you might be a fit please apply!#J-18808-Ljbffr

Job Details

View jobs in our app

Staff ML Ops Engineer

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Staff ML Ops Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care