MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Staff Product Manager - AI Eval Platform

Dropbox

Full-time
USA
$184k-$310k per year
product management
assistant
leadership
communication
analytics
Apply for this position

Role Description

As a Staff Product Manager within the Dash organization, you will play a crucial role in shaping how we measure and evaluate our AI-powered assistant and features. Dropbox is seeking a Staff Product Manager to lead AI Evaluations (Evals) — the systems, metrics, and processes that measure the quality and reliability of AI-powered features across Dropbox. In this role, you’ll define how we evaluate model performance, accuracy, and user satisfaction across diverse AI surfaces like Dash, search, summarization, and intelligent organization. You will be responsible for a core platform that enables every product team at Dropbox to launch new AI features with confidence, armed with the tools to measure their success both online and offline.

You’ll collaborate closely with Applied AI, Data Science, and Research to design frameworks that ensure our AI features are helpful, safe, and high-quality. This includes everything from defining success metrics for model improvements, to building scalable pipelines that assess qualitative and quantitative signals at scale.

This role sits at the intersection of AI systems, data rigor, and product judgment — ideal for a PM who loves turning ambiguity into measurable progress and ensuring that every AI interaction meets a bar of excellence.

Responsibilities

  • Define and drive the roadmap for Dropbox’s AI Evaluation Framework, covering both quantitative metrics and human-in-the-loop systems.

  • Define the strategic vision and north-star framework for how Dropbox measures AI performance, setting unified principles for quality, correctness, relevance, and reliability across Dash and other AI features.

  • End to end ownership of offline scoring pipelines, online instrumentation, dashboards, APIs, and LLM-as-Judge components used by all product teams.

  • Build and scale a self-serve measurement platform that enables any Dropbox team to launch features, run experiments, and measure performance with minimal friction.

  • Collaborate cross-functionally with ML, product, engineering, research, and data science to operationalize evaluation pipelines, design rubrics, and ensure metrics are valid, reproducible, and reliable.

  • Establish and maintain company-wide evaluation standards by defining rubrics, extending scorer taxonomies, and guidelines that become the foundation for AI quality measurement and benchmarking.

  • Integrate measurement systems into the product lifecycle by partnering with PMs and engineering to ensure evaluation and feedback loops are embedded from ideation through launch and iteration.

  • Communicate results, insights, and trade-offs to senior leadership, influencing product decisions and roadmap prioritization through clear storytelling backed by rigorous data.

Requirements

  • 10+ years of experience building measurement, analytics, or evaluation platforms, ideally in an ML/AI context (e.g. experimentation platform, metrics infrastructure, evaluation pipelines) particularly with an understanding of the end-to-end AI development lifecycle, from model training to deployment and monitoring.

  • BS/MS in Computer Science, Engineering, Business, Information Systems, Applied Math or Statistics, or relevant experience.

  • Experience designing and deploying evaluation frameworks and pipelines. E.g. solid offline vs online evaluation, metric definition and calibration, and human + model adjudication where needed. 

  • Deep understanding of ML evaluation, metrics, statistics. E.g. AUC, precision/recall, calibration, bias detection, variance, error analysis.

  • Technical fluency and ability to partner with engineers, software engineers, and data scientists. Candidate is comfortable reasoning about pipelines, APIs, performance, scale, latency, system tradeoffs, and more, with the ability to engage in deep technical discussions with engineers and data scientists, and translate complex technical concepts into clear product requirements.

  • Strong cross-functional collaboration skills. You will  need to work with PMs, researchers, engineers, data teams, labeling teams, and senior leaders.

  • Exceptional written and verbal communication skills, with a demonstrated ability to create clear, structured product documents and effectively communicate vision, trade-offs, and progress to stakeholders at all levels, including executives.

  • Bias, fairness, robustness mindset. Experience (or sensitivity) in designing evaluation with fairness / adversarial robustness / edge cases in mind.

Preferred Qualifications

  • Experience with developing or implementing LLM-based evaluation frameworks within a RAG (Retrieval-Augmented Generation)  context while leveraging LLM as a Judge for online evaluations. 

  • Hands-on experience with prompt evaluation, rubric design, human-in-the-loop evaluation, adversarial test design

  • Familiarity with experimentation at scale, including test design and measurement . e.g.  A/B testing systems, causal inference, counterfactual measurement.

  • 5+ years of experience in building self-service internal platforms / ML infrastructure / SDKs / APIs.

  • Experience building platforms or internal tools for technical users or developers and non-technical audiences alike. 

  • PhD or advanced degree in a quantitative field (CS, ML, statistics, etc.).

Compensation

US Zone 1

$229,500—$310,500 USD

US Zone 2

$206,600—$279,500 USD

US Zone 3

$183,600—$248,400 USD

Apply for this position
Bookmark Report

About the job

Full-time
USA
$184k-$310k per year
Posted 2 hours ago
product management
assistant
leadership
communication
analytics

Apply for this position

Bookmark
Report
Enhancv advertisement

30,000+
REMOTE JOBS

Unlock access to our database and
kickstart your remote career
Join Premium

Staff Product Manager - AI Eval Platform

Dropbox

Role Description

As a Staff Product Manager within the Dash organization, you will play a crucial role in shaping how we measure and evaluate our AI-powered assistant and features. Dropbox is seeking a Staff Product Manager to lead AI Evaluations (Evals) — the systems, metrics, and processes that measure the quality and reliability of AI-powered features across Dropbox. In this role, you’ll define how we evaluate model performance, accuracy, and user satisfaction across diverse AI surfaces like Dash, search, summarization, and intelligent organization. You will be responsible for a core platform that enables every product team at Dropbox to launch new AI features with confidence, armed with the tools to measure their success both online and offline.

You’ll collaborate closely with Applied AI, Data Science, and Research to design frameworks that ensure our AI features are helpful, safe, and high-quality. This includes everything from defining success metrics for model improvements, to building scalable pipelines that assess qualitative and quantitative signals at scale.

This role sits at the intersection of AI systems, data rigor, and product judgment — ideal for a PM who loves turning ambiguity into measurable progress and ensuring that every AI interaction meets a bar of excellence.

Responsibilities

  • Define and drive the roadmap for Dropbox’s AI Evaluation Framework, covering both quantitative metrics and human-in-the-loop systems.

  • Define the strategic vision and north-star framework for how Dropbox measures AI performance, setting unified principles for quality, correctness, relevance, and reliability across Dash and other AI features.

  • End to end ownership of offline scoring pipelines, online instrumentation, dashboards, APIs, and LLM-as-Judge components used by all product teams.

  • Build and scale a self-serve measurement platform that enables any Dropbox team to launch features, run experiments, and measure performance with minimal friction.

  • Collaborate cross-functionally with ML, product, engineering, research, and data science to operationalize evaluation pipelines, design rubrics, and ensure metrics are valid, reproducible, and reliable.

  • Establish and maintain company-wide evaluation standards by defining rubrics, extending scorer taxonomies, and guidelines that become the foundation for AI quality measurement and benchmarking.

  • Integrate measurement systems into the product lifecycle by partnering with PMs and engineering to ensure evaluation and feedback loops are embedded from ideation through launch and iteration.

  • Communicate results, insights, and trade-offs to senior leadership, influencing product decisions and roadmap prioritization through clear storytelling backed by rigorous data.

Requirements

  • 10+ years of experience building measurement, analytics, or evaluation platforms, ideally in an ML/AI context (e.g. experimentation platform, metrics infrastructure, evaluation pipelines) particularly with an understanding of the end-to-end AI development lifecycle, from model training to deployment and monitoring.

  • BS/MS in Computer Science, Engineering, Business, Information Systems, Applied Math or Statistics, or relevant experience.

  • Experience designing and deploying evaluation frameworks and pipelines. E.g. solid offline vs online evaluation, metric definition and calibration, and human + model adjudication where needed. 

  • Deep understanding of ML evaluation, metrics, statistics. E.g. AUC, precision/recall, calibration, bias detection, variance, error analysis.

  • Technical fluency and ability to partner with engineers, software engineers, and data scientists. Candidate is comfortable reasoning about pipelines, APIs, performance, scale, latency, system tradeoffs, and more, with the ability to engage in deep technical discussions with engineers and data scientists, and translate complex technical concepts into clear product requirements.

  • Strong cross-functional collaboration skills. You will  need to work with PMs, researchers, engineers, data teams, labeling teams, and senior leaders.

  • Exceptional written and verbal communication skills, with a demonstrated ability to create clear, structured product documents and effectively communicate vision, trade-offs, and progress to stakeholders at all levels, including executives.

  • Bias, fairness, robustness mindset. Experience (or sensitivity) in designing evaluation with fairness / adversarial robustness / edge cases in mind.

Preferred Qualifications

  • Experience with developing or implementing LLM-based evaluation frameworks within a RAG (Retrieval-Augmented Generation)  context while leveraging LLM as a Judge for online evaluations. 

  • Hands-on experience with prompt evaluation, rubric design, human-in-the-loop evaluation, adversarial test design

  • Familiarity with experimentation at scale, including test design and measurement . e.g.  A/B testing systems, causal inference, counterfactual measurement.

  • 5+ years of experience in building self-service internal platforms / ML infrastructure / SDKs / APIs.

  • Experience building platforms or internal tools for technical users or developers and non-technical audiences alike. 

  • PhD or advanced degree in a quantitative field (CS, ML, statistics, etc.).

Compensation

US Zone 1

$229,500—$310,500 USD

US Zone 2

$206,600—$279,500 USD

US Zone 3

$183,600—$248,400 USD

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Free Job Alerts

Job Skills
Jobs by Location
API
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2025 Working Nomads.