MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Evaluation Engineer

Elicit

Full-time
USA
$140k-$200k per year
engineer
python
statistics
software engineering
processing
Apply for this position

About Elicit

Elicit is an AI research platform that uses language models to help researchers figure out what's true and make better decisions, starting with common research tasks like literature review.

What we're aiming for:

  • Elicit radically increases the amount of good reasoning in the world.

    • For experts, Elicit pushes the frontier forward.

    • For non-experts, Elicit makes good reasoning more affordable. People who don't have the tools, expertise, time, or mental energy to make well-reasoned decisions on their own can do so with Elicit.

  • Elicit is a scalable ML system based on human-understandable task decompositions, with supervision of process, not outcomes. This expands our collective understanding of safe AGI architectures.

Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.

The mission of Elicit evals

Some orgs build evals to warn us about dangerous capabilities. Others build evals to understand trends and predict future developments. Yet others build evals to hill-climb towards models that users will like more.

At Elicit, we're focused on something different—we want to understand, and hill-climb towards, models that help us make better decisions.

This is tougher than 'what will users like better'—it's hard to evaluate decision support, and users' knee-jerk reactions may not align with what actually helps for decision-making. Because it's hard, and because the sales pitch is more complicated, there aren't many doing this well. If we nail this, we have a unique opportunity to push AI toward helping us make better decisions, both within Elicit and beyond.

Why we're hiring for this role

We need someone to own the technical foundation of our auto-evaluation systems. Our evals are currently much slower than they need to be, and our interfaces aren't optimized for the diverse set of people who need to use them—ML engineers iterating on models, product managers monitoring quality, and customers assessing trust in results.

The right person for this role won't just build infrastructure. You'll think deeply about what it actually means for Elicit to help with decision-making in pharma and encode that understanding into our evaluation systems.

What you'll own

The core auto-eval platform

You'll build a comprehensive system that runs fast, is easy to use, and supports quickly building new evals:

  • Speed: You’ll build a lightning-fast basic evals infrastructure that schedules tasks to introduce practically no latency; and then you’ll figure out clever ways to solve the fundamental sources of latency (building a version of Elicit, running it on a query, and evaluating it using LMs)

  • Interfaces: ML engineers need evals to kick off automatically on relevant commits, with results they can see at a glance and drill into. Product managers need dashboards showing performance over time and what's going wrong in production.

  • Architecture: Your code must be well-architected so other team members and ML engineers can understand and build on it. An engineer starting on a new feature should be able to quickly add examples and run an eval.

Ensuring evaluations are accurate and reliable

  • We need to evaluate how well Elicit actually helps with decision-making in pharma, not just measure what's easy to measure. This requires encoding real knowledge about how pharma customers make decisions (for example, choosing appropriate gold standards).

  • You'll provide appropriate statistical tests and confidence intervals so we can trust our results.

A month in your life

In a typical month, expect to spend:

  • 60% working on the core eval platform

  • 15% working closely with the evals team to build and improve specific evals (e.g., an eval of our paper search within our systematic review flow)

  • 10% mentoring our evals engineering intern

  • The rest on learning how people interact with the eval system so you can make it work better for them, and understanding what our users want from Elicit so evals measure what matters

What you bring to the role

Requirements

  • At least 3 years of experience as a professional software engineer, with demonstrated experience building complex backend systems (e.g., backend for a complex website, data pipelines, etc.)

  • Aptitude and interest in evaluating how Elicit helps with pharma decision-making. There's no particular experience you must have, but we'll evaluate your aptitude.

Will make you more competitive for the role

  • Knowledge of statistics (for e.g. calculating power and credence intervals for evals)

  • Experience with advanced Python (asyncio/trio and parallel processing strategies)

  • Front-end experience and strong UX sensibility (you'll be building dashboards). TypeScript experience is a plus.

  • Experience building developer tools (ML engineers are one of your most important clients)

  • Previous experience as a data engineer or working on AI infrastructure

  • Knowledge of pharma/biomed

  • Experience evaluating ML systems

  • Experience building language-model-based systems (helps with understanding Elicit and how to evaluate it)

This is a diverse list of nice-to-haves. We expect the candidate we select to have some, but not all, of these. Other team members can fill in for skills you lack.

Location and travel

We have a lovely office in Oakland, CA, but we’re flexible about where you work. You’re welcome to work remotely, from our Oakland headquarters, or in a hybrid setup. The only in-person requirement is attending our quarterly team retreats, typically held on the west coast.

Compensation, benefits, and perks

In addition to working on important problems as part of a productive and positive team, we also offer great benefits (with some variation based on location):

  • Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events

  • Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family

  • Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays

  • 401K with a 6% employer match

  • A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter

  • $1,000 quarterly AI Experimentation & Learning budget, so you can freely experiment with new AI tools, take courses, purchase educational resources, or attend AI-focused conferences and events

  • A team administrative assistant who can help you with personal and work tasks

For all roles at Elicit, we use a data-backed compensation framework to keep salaries market-competitive, equitable, and simple to understand. For this role, we target starting ranges of:

  • Career (L3): $140-170k + equity

  • Senior (L4): $165-200k + equity

Apply for this position
Bookmark Report

About the job

Full-time
USA
Mid Level
$140k-$200k per year
Posted 3 weeks ago
engineer
python
statistics
software engineering
processing

Apply for this position

Bookmark
Report
Enhancv advertisement
+ 1,284 new jobs added today
30,000+
Remote Jobs

Don't miss out — new listings every hour

Join Premium

Evaluation Engineer

Elicit

About Elicit

Elicit is an AI research platform that uses language models to help researchers figure out what's true and make better decisions, starting with common research tasks like literature review.

What we're aiming for:

  • Elicit radically increases the amount of good reasoning in the world.

    • For experts, Elicit pushes the frontier forward.

    • For non-experts, Elicit makes good reasoning more affordable. People who don't have the tools, expertise, time, or mental energy to make well-reasoned decisions on their own can do so with Elicit.

  • Elicit is a scalable ML system based on human-understandable task decompositions, with supervision of process, not outcomes. This expands our collective understanding of safe AGI architectures.

Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.

The mission of Elicit evals

Some orgs build evals to warn us about dangerous capabilities. Others build evals to understand trends and predict future developments. Yet others build evals to hill-climb towards models that users will like more.

At Elicit, we're focused on something different—we want to understand, and hill-climb towards, models that help us make better decisions.

This is tougher than 'what will users like better'—it's hard to evaluate decision support, and users' knee-jerk reactions may not align with what actually helps for decision-making. Because it's hard, and because the sales pitch is more complicated, there aren't many doing this well. If we nail this, we have a unique opportunity to push AI toward helping us make better decisions, both within Elicit and beyond.

Why we're hiring for this role

We need someone to own the technical foundation of our auto-evaluation systems. Our evals are currently much slower than they need to be, and our interfaces aren't optimized for the diverse set of people who need to use them—ML engineers iterating on models, product managers monitoring quality, and customers assessing trust in results.

The right person for this role won't just build infrastructure. You'll think deeply about what it actually means for Elicit to help with decision-making in pharma and encode that understanding into our evaluation systems.

What you'll own

The core auto-eval platform

You'll build a comprehensive system that runs fast, is easy to use, and supports quickly building new evals:

  • Speed: You’ll build a lightning-fast basic evals infrastructure that schedules tasks to introduce practically no latency; and then you’ll figure out clever ways to solve the fundamental sources of latency (building a version of Elicit, running it on a query, and evaluating it using LMs)

  • Interfaces: ML engineers need evals to kick off automatically on relevant commits, with results they can see at a glance and drill into. Product managers need dashboards showing performance over time and what's going wrong in production.

  • Architecture: Your code must be well-architected so other team members and ML engineers can understand and build on it. An engineer starting on a new feature should be able to quickly add examples and run an eval.

Ensuring evaluations are accurate and reliable

  • We need to evaluate how well Elicit actually helps with decision-making in pharma, not just measure what's easy to measure. This requires encoding real knowledge about how pharma customers make decisions (for example, choosing appropriate gold standards).

  • You'll provide appropriate statistical tests and confidence intervals so we can trust our results.

A month in your life

In a typical month, expect to spend:

  • 60% working on the core eval platform

  • 15% working closely with the evals team to build and improve specific evals (e.g., an eval of our paper search within our systematic review flow)

  • 10% mentoring our evals engineering intern

  • The rest on learning how people interact with the eval system so you can make it work better for them, and understanding what our users want from Elicit so evals measure what matters

What you bring to the role

Requirements

  • At least 3 years of experience as a professional software engineer, with demonstrated experience building complex backend systems (e.g., backend for a complex website, data pipelines, etc.)

  • Aptitude and interest in evaluating how Elicit helps with pharma decision-making. There's no particular experience you must have, but we'll evaluate your aptitude.

Will make you more competitive for the role

  • Knowledge of statistics (for e.g. calculating power and credence intervals for evals)

  • Experience with advanced Python (asyncio/trio and parallel processing strategies)

  • Front-end experience and strong UX sensibility (you'll be building dashboards). TypeScript experience is a plus.

  • Experience building developer tools (ML engineers are one of your most important clients)

  • Previous experience as a data engineer or working on AI infrastructure

  • Knowledge of pharma/biomed

  • Experience evaluating ML systems

  • Experience building language-model-based systems (helps with understanding Elicit and how to evaluate it)

This is a diverse list of nice-to-haves. We expect the candidate we select to have some, but not all, of these. Other team members can fill in for skills you lack.

Location and travel

We have a lovely office in Oakland, CA, but we’re flexible about where you work. You’re welcome to work remotely, from our Oakland headquarters, or in a hybrid setup. The only in-person requirement is attending our quarterly team retreats, typically held on the west coast.

Compensation, benefits, and perks

In addition to working on important problems as part of a productive and positive team, we also offer great benefits (with some variation based on location):

  • Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events

  • Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family

  • Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays

  • 401K with a 6% employer match

  • A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter

  • $1,000 quarterly AI Experimentation & Learning budget, so you can freely experiment with new AI tools, take courses, purchase educational resources, or attend AI-focused conferences and events

  • A team administrative assistant who can help you with personal and work tasks

For all roles at Elicit, we use a data-backed compensation framework to keep salaries market-competitive, equitable, and simple to understand. For this role, we target starting ranges of:

  • Career (L3): $140-170k + equity

  • Senior (L4): $165-200k + equity

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Reviews
Job Alerts

Job Skills
Jobs by Location
Jobs by Experience Level
Jobs by Position Type
Jobs by Salary
API
Scam Alert
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Entry Level jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Belgium
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2026 Working Nomads.