Supporting the USA's leading startups with world class AI & Robotics Talent | Co-Founder of Mentors in Machine Learning | Recruitment like a 5* hotel
Join the Frontier: Research Engineer, AI Benchmarking
Are you passionate about shaping how the world measures and trusts AI? We're seeking exceptional AI researchers and engineers to architect the next generation of LLM benchmarks—impacting how foundation models evolve and are adopted globally.
Your work will define the standards by which LLMs are judged—from everyday applications to breakthroughs in finance, healthcare, and beyond. You'll design, build, and analyze cutting‑edge evaluation pipelines, collaborating with leading model labs and enterprises. If you thrive at the intersection of deep research and real‑world impact, this is your stage.
What You'll Do
- Invent and build new benchmarks that test the boundaries of LLMs in real‑world scenarios
- Conduct rigorous research to ensure benchmarks are robust, valid, and actionable
- Collaborate with AI labs and enterprise partners to identify emerging evaluation needs
- Analyze and interpret model performance, communicating insights to diverse audiences
- Publish and present research findings in top venues, contributing to the evaluation community
- Work closely with infra engineers to scale your benchmark designs
- Stay ahead of the curve on LLM capabilities and evaluation methodologies
Your Background
- Advanced research experience: MS/PhD in CS, NLP, ML, or related field (exceptional undergrads considered)
- Publication record: Papers at NeurIPS, ICML, ACL, EMNLP, etc.—especially on NLP, ML evaluation, or benchmarking
- Python proficiency for prototyping and experimentation
- Excellent communicator, able to synthesize complex ideas for all audiences
- Collaborative spirit: Experience working in research teams, open to feedback
- Portfolio: Evidence of impactful research
Location: In‑person in San Francisco. Relocation/transportation support provided.
Bonus Points
- Experience with LLM evaluation, benchmarking, or foundation models
- Collaboration with industry or applied research partners
- Background in HCI, psychology, or domain‑specific evaluation
- Startup or early‑stage lab experience
- Contributions to open‑source evaluation tools/datasets
What's in It for You?
- Competitive salary & meaningful equity
- Relocation and transit support
- Unlimited PTO
- Opportunities to publish, present, and shape the field
Who We Are
Our founding team brings together leading experience from top research institutions and industry giants. The platform's core is rooted in advanced NLP evaluation research and is backed by premier investors. Our collective work is highly cited, and we're committed to setting the gold standard for AI benchmarking. Tech stack: React (TSX) frontend, Django backend, AWS infra.
What Matters Most
- Raw intelligence and research ability trump pedigree. We care about what you can build and discover.
- Ownership: We move fast and expect initiative. You'll have autonomy and a chance to make a visible impact.
- Intensity: The LLM landscape evolves at breakneck speed. We need researchers who thrive in a dynamic, high‑execution environment.
- Solution focus: Every evaluation challenge is an opportunity to innovate.
Seniority level
Mid‑Senior level
Employment type
Full‑time
Job function
Information Technology
Industries
Technology, Information & Media and Research Services