MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Principal Site Reliability Engineer

Upwork

Full-time
USA
engineer
devops
python
javascript
cloud
The job listing has expired. Unfortunately, the hiring company is no longer accepting new applications.

To see similar active jobs please follow this link: Remote System Administration jobs

Upwork ($UPWK) is the world’s work marketplace. We serve everyone from one-person startups to over 30% of the Fortune 100 with a powerful, trust-driven platform that enables companies and talent to work together in new ways that unlock their potential.  Last year, more than $3.8 billion of work was done through Upwork by skilled professionals who are gaining more control by finding work they are passionate about and innovating their careers. This is an engagement through Upwork’s Hybrid Workforce Solutions (HWS) Team. Our Hybrid Workforce Solutions Team is a global group of professionals that support Upwork’s business. Our HWS team members are located all over the world. This is an opportunity to work with a major revenue-producing website with millions of users. In addition to making sure everything works you are also expected to contribute to the continuous improvement of our environment. This is a full time position (~40 hours per week, Monday-Friday). This role will participate in our production on-call rotation in your day-time and on some weekends (once every 2-3 weeks).

Work/Project Scope:

  • Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability.

  • Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration.

  • Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards.

  • Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes.

  • Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution.

  • Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment.

  • Drive infrastructure automation efforts using IaC best practices, with an emphasis on policy-as-code, workload identity, and platform governance.

  • Lead post-incident reviews and reliability audits to surface systemic gaps and drive continuous improvement.

  • Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems.

Must Haves (Required Skills):

  • 10+ years in SRE, DevOps, or production engineering roles, including experience operating large-scale distributed systems in production

  • Deep expertise in Kubernetes operations, including multi-cluster orchestration, service mesh (Istio or equivalent), and workload policy management (e.g., OPA, Kyverno)

  • Proven experience building and maintaining GitOps pipelines using tools like ArgoCD or Flux

  • Strong fluency in observability tooling (e.g., Prometheus, OpenTelemetry, Grafana, or Datadog), with a focus on SLO-based alerting and incident detection

  • Familiarity with reliability-as-code practices and automation using scripting languages (Python, Go, or Bash) and AI-enhanced workflows (e.g., Cursor, incident bots, PR-generating agents)

  • Experience designing and enforcing zero trust service-to-service authentication, workload identity, and mTLS policies

  • Track record of leading incident review programs, standardizing postmortems, and driving systemic reliability improvements

  • Ability to work cross-functionally with platform, security, and developer enablement teams to embed resilience across the SDLC.

Upwork is proudly committed to fostering a diverse and inclusive workforce. We never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.   

About the job

Full-time
USA
Posted 4 weeks ago
engineer
devops
python
javascript
cloud
Enhancv advertisement

30,000+
REMOTE JOBS

Unlock access to our database and
kickstart your remote career
Join Premium

Principal Site Reliability Engineer

Upwork
The job listing has expired. Unfortunately, the hiring company is no longer accepting new applications.

To see similar active jobs please follow this link: Remote System Administration jobs

Upwork ($UPWK) is the world’s work marketplace. We serve everyone from one-person startups to over 30% of the Fortune 100 with a powerful, trust-driven platform that enables companies and talent to work together in new ways that unlock their potential.  Last year, more than $3.8 billion of work was done through Upwork by skilled professionals who are gaining more control by finding work they are passionate about and innovating their careers. This is an engagement through Upwork’s Hybrid Workforce Solutions (HWS) Team. Our Hybrid Workforce Solutions Team is a global group of professionals that support Upwork’s business. Our HWS team members are located all over the world. This is an opportunity to work with a major revenue-producing website with millions of users. In addition to making sure everything works you are also expected to contribute to the continuous improvement of our environment. This is a full time position (~40 hours per week, Monday-Friday). This role will participate in our production on-call rotation in your day-time and on some weekends (once every 2-3 weeks).

Work/Project Scope:

  • Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability.

  • Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration.

  • Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards.

  • Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes.

  • Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution.

  • Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment.

  • Drive infrastructure automation efforts using IaC best practices, with an emphasis on policy-as-code, workload identity, and platform governance.

  • Lead post-incident reviews and reliability audits to surface systemic gaps and drive continuous improvement.

  • Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems.

Must Haves (Required Skills):

  • 10+ years in SRE, DevOps, or production engineering roles, including experience operating large-scale distributed systems in production

  • Deep expertise in Kubernetes operations, including multi-cluster orchestration, service mesh (Istio or equivalent), and workload policy management (e.g., OPA, Kyverno)

  • Proven experience building and maintaining GitOps pipelines using tools like ArgoCD or Flux

  • Strong fluency in observability tooling (e.g., Prometheus, OpenTelemetry, Grafana, or Datadog), with a focus on SLO-based alerting and incident detection

  • Familiarity with reliability-as-code practices and automation using scripting languages (Python, Go, or Bash) and AI-enhanced workflows (e.g., Cursor, incident bots, PR-generating agents)

  • Experience designing and enforcing zero trust service-to-service authentication, workload identity, and mTLS policies

  • Track record of leading incident review programs, standardizing postmortems, and driving systemic reliability improvements

  • Ability to work cross-functionally with platform, security, and developer enablement teams to embed resilience across the SDLC.

Upwork is proudly committed to fostering a diverse and inclusive workforce. We never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.   

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Free Job Alerts

Job Skills
API
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2025 Working Nomads.