Full-time

Philippines

python

kubernetes

What You’ll Do

Design and implement agent evaluation pipelines that benchmark AI capabilities across real-world enterprise use cases.
Build domain-specific benchmarks for product support, engineering ops, GTM insights, and other verticals relevant to modern SaaS
Develop performance benchmarks that measure and optimize for latency, safety, cost-efficiency, and user-perceived quality.
Create search- and retrieval-oriented benchmarks, including multilingual query handling, annotation-aware scoring, and context relevance.
Partner with AI and infra teams to instrument models and agents with detailed elemetry for outcome-based evaluation.
Drive human-in-the-loop and programmatic testing methodologies for fuzzy metrics like helpfulness, intent alignment, and resolution effectiveness.
Contribute to DevRev’s open evaluation tooling and benchmarking frameworks, shaping how the broader ecosystem thinks about SaaS AI performance.

What We’re Looking For

3–7 years of experience in systems, infra, or performance engineering roles with strong ownership of metrics and benchmarking.
Fluency in Python and comfort working across full-stack and backend services.
Experience building or using LLMs, vector-based search, or agentic frameworks in production environments.
Familiarity with LLM model serving infrastructure (e.g., vLLM, Triton, Ray, or custom Kubernetes-based deployments), including observability, autoscaling, and token streaming
Experience working with model tuning workflows, including prompt engineering, fine-tuning (e.g., LoRA, DPO), or evaluation loops for post-training optimization
Deep appreciation for measuring what matters — whether it’s latency under load, degradation in retrieval precision, or regression in AI output quality
Familiarity with evaluation techniques in NLP, information retrieval, or human-centered AI (e.g. RAGAS, Recall@K, BLEU, etc.)
Strong product and user intuition — you care about what the benchmark represents, not just what it measures

Bonus: experience contributing to academic or open-source benchmarking projects

Why This Role Matters

Agents are not APIs — they reason, adapt, and learn. But with that power comes ambiguity in how we measure success. At DevRev, we believe the benchmarks of the past aren’t enough for the software of the future.
This role is your opportunity to design the KPIs of the AI-native enterprise — to bring rigor to systems that reason, and structure to software that thinks.
Join us to shape how intelligence is measured in SaaS 2.0

Bookmark Report

Member of Technical Staff: AI Performance

DevRev

What You’ll Do

Design and implement agent evaluation pipelines that benchmark AI capabilities across real-world enterprise use cases.
Build domain-specific benchmarks for product support, engineering ops, GTM insights, and other verticals relevant to modern SaaS
Develop performance benchmarks that measure and optimize for latency, safety, cost-efficiency, and user-perceived quality.
Create search- and retrieval-oriented benchmarks, including multilingual query handling, annotation-aware scoring, and context relevance.
Partner with AI and infra teams to instrument models and agents with detailed elemetry for outcome-based evaluation.
Drive human-in-the-loop and programmatic testing methodologies for fuzzy metrics like helpfulness, intent alignment, and resolution effectiveness.
Contribute to DevRev’s open evaluation tooling and benchmarking frameworks, shaping how the broader ecosystem thinks about SaaS AI performance.

What We’re Looking For

3–7 years of experience in systems, infra, or performance engineering roles with strong ownership of metrics and benchmarking.
Fluency in Python and comfort working across full-stack and backend services.
Experience building or using LLMs, vector-based search, or agentic frameworks in production environments.
Familiarity with LLM model serving infrastructure (e.g., vLLM, Triton, Ray, or custom Kubernetes-based deployments), including observability, autoscaling, and token streaming
Experience working with model tuning workflows, including prompt engineering, fine-tuning (e.g., LoRA, DPO), or evaluation loops for post-training optimization
Deep appreciation for measuring what matters — whether it’s latency under load, degradation in retrieval precision, or regression in AI output quality
Familiarity with evaluation techniques in NLP, information retrieval, or human-centered AI (e.g. RAGAS, Recall@K, BLEU, etc.)
Strong product and user intuition — you care about what the benchmark represents, not just what it measures

Bonus: experience contributing to academic or open-source benchmarking projects

Why This Role Matters

Agents are not APIs — they reason, adapt, and learn. But with that power comes ambiguity in how we measure success. At DevRev, we believe the benchmarks of the past aren’t enough for the software of the future.
This role is your opportunity to design the KPIs of the AI-native enterprise — to bring rigor to systems that reason, and structure to software that thinks.
Join us to shape how intelligence is measured in SaaS 2.0

About the job