MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Senior Site Reliability Engineer (SRE)

Finite State

Full-time
USA, Canada
$215k-$250k per year
engineer
aws
architecture
saas
cloud
Apply for this position

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization. This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem.

This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code.

If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you.

What You’ll Do

Observability & Reliability Leadership

Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity.

  • Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads.

  • Define and implement a comprehensive observability framework across applications and infrastructure.

  • Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives.

  • Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms.

  • Drive best practices in error budgeting, alert design, and production health monitoring.

Operational Excellence

  • Define and evolve incident management processes, including:

    • On-call structures and escalation models

    • Postmortems and blameless retrospectives

    • Runbooks and operational playbooks

  • Improve system reliability, performance, scalability, and cost efficiency.

  • Establish operational KPIs and reliability dashboards for engineering and leadership visibility.

  • Lead reliability reviews for new architecture and product initiatives.

Infrastructure Engineering

  • Architect and implement scalable cloud infrastructure primarily within AWS.

  • Work closely with modern application platforms such as Vercel and Supabase.

  • Implement and improve Infrastructure-as-Code practices.

  • Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation.

  • Ensure production-grade security, compliance, and resilience standards.

AI-First Enablement

  • Champion the use of AI tools to:

    • Accelerate infrastructure provisioning

    • Improve operational workflows

    • Enhance observability signal quality

    • Automate incident response and remediation

  • Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability.

Technical Leadership

  • Serve as a senior technical authority for reliability and infrastructure decisions.

  • Mentor engineers on production best practices.

  • Influence architectural decisions to improve system resilience and maintainability.

  • Drive a culture of reliability, accountability, and continuous improvement.

What You Bring

Experience

  • 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering.

  • Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale.

  • Deep experience building and managing on-call rotations and incident management processes.

  • Strong background in distributed systems and cloud-native architectures.

Technical Expertise

  • Hands-on experience with:

    • Honeycomb

    • Grafana

    • AWS

    • Vercel

    • Supabase

  • Strong experience with observability instrumentation and telemetry design.

  • Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar).

  • Experience designing resilient CI/CD pipelines.

  • Deep understanding of high-availability, scalability, and performance engineering principles.

AI & Automation

  • Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows.

  • Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations.

  • Strong interest in building AI-native operational practices.

Leadership & Communication

  • Ability to operate as both strategic architect and hands-on implementer.

  • Strong written and verbal communication skills.

  • Experience influencing cross-functional teams.

  • Comfort working in fast-paced, high-growth environments.

Nice to Have

  • Experience supporting AI/ML workloads in production.

  • Experience building internal developer platforms (IDP).

  • Experience with cost observability and FinOps practices.

  • Experience scaling observability in high-growth SaaS environments.

What Success Looks Like in the First 6 Months

  • Clear SLO framework implemented across core services.

  • Observability tooling standardized and adopted organization-wide.

  • On-call and incident management processes running smoothly with measurable improvements.

  • AI-driven infrastructure workflows reducing operational toil.

Increased system reliability and reduced mean time to detection (MTTD) and recovery (MTTR).

Compensation

Our salary ranges are categorized into two tiers based on geographic location:

  • Tier 1 (San Francisco, New York, Seattle): $230,000 - $250,000

  • Tier 2 (All Other Locations): $215,000 - $240,000

The final base salary will be determined by experience, skill set, and specific location. In addition to base pay, this role is eligible for equity and benefits.

Apply for this position
Bookmark Report

About the job

Full-time
USA, Canada
Senior Level
$215k-$250k per year
Posted 1 hour ago
engineer
aws
architecture
saas
cloud

Apply for this position

Bookmark
Report
Enhancv advertisement
+ 1,284 new jobs added today
30,000+
Remote Jobs

Don't miss out — new listings every hour

Join Premium

Senior Site Reliability Engineer (SRE)

Finite State

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization. This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem.

This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code.

If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you.

What You’ll Do

Observability & Reliability Leadership

Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity.

  • Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads.

  • Define and implement a comprehensive observability framework across applications and infrastructure.

  • Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives.

  • Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms.

  • Drive best practices in error budgeting, alert design, and production health monitoring.

Operational Excellence

  • Define and evolve incident management processes, including:

    • On-call structures and escalation models

    • Postmortems and blameless retrospectives

    • Runbooks and operational playbooks

  • Improve system reliability, performance, scalability, and cost efficiency.

  • Establish operational KPIs and reliability dashboards for engineering and leadership visibility.

  • Lead reliability reviews for new architecture and product initiatives.

Infrastructure Engineering

  • Architect and implement scalable cloud infrastructure primarily within AWS.

  • Work closely with modern application platforms such as Vercel and Supabase.

  • Implement and improve Infrastructure-as-Code practices.

  • Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation.

  • Ensure production-grade security, compliance, and resilience standards.

AI-First Enablement

  • Champion the use of AI tools to:

    • Accelerate infrastructure provisioning

    • Improve operational workflows

    • Enhance observability signal quality

    • Automate incident response and remediation

  • Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability.

Technical Leadership

  • Serve as a senior technical authority for reliability and infrastructure decisions.

  • Mentor engineers on production best practices.

  • Influence architectural decisions to improve system resilience and maintainability.

  • Drive a culture of reliability, accountability, and continuous improvement.

What You Bring

Experience

  • 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering.

  • Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale.

  • Deep experience building and managing on-call rotations and incident management processes.

  • Strong background in distributed systems and cloud-native architectures.

Technical Expertise

  • Hands-on experience with:

    • Honeycomb

    • Grafana

    • AWS

    • Vercel

    • Supabase

  • Strong experience with observability instrumentation and telemetry design.

  • Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar).

  • Experience designing resilient CI/CD pipelines.

  • Deep understanding of high-availability, scalability, and performance engineering principles.

AI & Automation

  • Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows.

  • Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations.

  • Strong interest in building AI-native operational practices.

Leadership & Communication

  • Ability to operate as both strategic architect and hands-on implementer.

  • Strong written and verbal communication skills.

  • Experience influencing cross-functional teams.

  • Comfort working in fast-paced, high-growth environments.

Nice to Have

  • Experience supporting AI/ML workloads in production.

  • Experience building internal developer platforms (IDP).

  • Experience with cost observability and FinOps practices.

  • Experience scaling observability in high-growth SaaS environments.

What Success Looks Like in the First 6 Months

  • Clear SLO framework implemented across core services.

  • Observability tooling standardized and adopted organization-wide.

  • On-call and incident management processes running smoothly with measurable improvements.

  • AI-driven infrastructure workflows reducing operational toil.

Increased system reliability and reduced mean time to detection (MTTD) and recovery (MTTR).

Compensation

Our salary ranges are categorized into two tiers based on geographic location:

  • Tier 1 (San Francisco, New York, Seattle): $230,000 - $250,000

  • Tier 2 (All Other Locations): $215,000 - $240,000

The final base salary will be determined by experience, skill set, and specific location. In addition to base pay, this role is eligible for equity and benefits.

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Reviews
Job Alerts

Job Skills
Jobs by Location
Jobs by Experience Level
Jobs by Position Type
Jobs by Salary
API
Scam Alert
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Entry Level jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Belgium
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2026 Working Nomads.