Overview
Machine Learning Engineer — Post-Training, Evaluation & Continuous Improvement
Location: Bay Area preferred (remote-first; ~10% in-office for workshops/on-sites)
Employment: Full-time (new grads welcome) • Internship/Co-op options available
About AlphaX
AlphaX builds financial reasoning models and agent workflows for professional investment research. Our stack spans RLHF/DPO post-training, reasoning workflow generation, agent tooling & model routing, and a financial data lake (prices, filings, transcripts, news).
The Role
Own the complete post-training life cycle for AlphaX's financial reasoning model—from data curation through human-in-the-loop feedback, evaluation, regression gating, and continuous improvement. You'll stand up and maintain the training/eval pipelines that turn research ideas into measurable production gains across our analyst-style task suite (e.g., earnings analysis, risk scoring, forecast memos).
What You'll Do
- Post-training & alignment
- Run and refine SFT and preference-based training (RLHF, DPO, KTO or similar) on finance-specific datasets.
- Train and version reward models and rubric-based scorers for reasoning quality, factuality, and safety.
- Build ingestion & cleaning for filings, transcripts, market data, and analyst workflows; dedup, redact, and split with leakage controls.
- Operate human-in-the-loop loops (experts, students, crowd) with clear rubrics and QA.
- Design a multi-layer eval harness: unit tests for tools/prompts, scenario suites for research tasks, red-team probes, latency/cost tracking.
- Implement automated A/B and canary gating with statistically sound decision rules and regression alerts.
- Instrument chain-of-thought-free scoring proxies, tool-use success rates, and multi-step task completion.
- Tune prompt policies, tool-calling strategies, and model routing (OpenAI/Claude/Gemini, etc.) behind a consistent interface.
- Ship pipelines on GPUs, schedule jobs, track experiments, and maintain reproducible artifacts and datasets.
- Add observability for drift, outliers, PII/financial compliance checks.
- Close the loop: mine failures from production, generate counter-examples, synthesize new training/eval data, and re-train with tight feedback cycles.
Qualifications
- BS/MS (or rising senior) in CS/EE/Math or equivalent with hands-on experience shipping post-training pipelines.
- Practical experience with one or more: RLHF/DPO/KTO, reward modeling, or structured preference data.
- Experience building training and evaluation pipelines end-to-end (data → train → eval → release), including experiment tracking (e.g., MLflow/W&B) and artifact/version control (e.g., DVC, Git-LFS).
- Comfort reading financial text and reasoning about factuality & compliance (you don't need to be an investor, just curious and precise).
Tech You'll Touch
Python, PyTorch, Ray, MLflow/W&B, Airflow/Prefect, GPUs, Postgres/BigQuery, vector DBs, Docker/K8s, and major model APIs (OpenAI/Claude/Gemini).
Work Setup
Remote-first with ~10% in-office (Bay Area) for collaboration sprints, model & agent jam sessions, and eval workshops.