Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Senior Platform & Reliability Engineer

2026-06-17 OpenArt AI San Francisco,CA

Description:

Senior Platform & Reliability Engineer

OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We're building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination. We believe the future of creativity is AI-native, and we're shaping that future.

About the Role

We're looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization. This is not a traditional ops or DevOps role. You'll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiencyin a fast-moving, AI-native environment. You'll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of userswhile raising the overall engineering bar.

What You'll Do

Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.
Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items.
Improve system resilience at external boundaries (AI providers, storage, etc.), including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand "what broke" and "why."
Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.
Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale.
Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning.
Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team.

What We're Looking For

Core Requirements

5+ years building and operating production systems where reliability and scaling are important.
Strong software engineering skills you can build and ship production code, not just configure infrastructure.
Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes).
Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.
Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).
Ability to communicate technical tradeoffs clearly to engineers across different domains.
Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.
Nice to Have
Experience building internal platform abstractions (e.g., job orchestration, API layers, workflow systems) that improve team velocity.
Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost.
Experience working in a startup or high-growth environment, with broad ownership across systems.

Tech Stack You'll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React/Next.js, Node.js, TypeScript, Python, etc.

Compensation

Competitive base salary and bonus program
Equity - meaningful ownership in what you build
High autonomy, high growth environment

Work Setup

Bay Area preferred (hybrid allowed)
Visa sponsorship available
We'll consider remote

Job Details

View jobs in our app

Senior Platform & Reliability Engineer

Senior Platform & Reliability Engineer

About the Role

What You'll Do

What We're Looking For

Tech Stack You'll Work With

Compensation

Work Setup

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Senior Platform & Reliability Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care