Senior Site Reliability Engineer

Roadie

Full-time

USA

engineer

ruby on rails

devops

python

docker

Apply for this position

Roadie, a UPS company, is a leading logistics and delivery platform that helps businesses tackle the complexities of modern retail with unmatched delivery coverage, flexibility and visibility. Reaching 97% of U.S. households across more than 30,000 zip codes — from urban hubs to rural communities — Roadie provides seamless, scalable solutions that meet a variety of delivery needs.

With a network of more than 310,000 independent drivers nationwide, Roadie offers flexible delivery solutions that make complex logistics challenges easy, including solutions for local same-day delivery, delivery of big and bulky items, ship-from-store and DC-to-door.

Roadie is seeking a Senior Site Reliability Engineer to join our growing Technical Operations Team. We are looking for a candidate who has experience implementing site reliability principals, as well as production level Kubernetes experience. The ideal candidate is a skilled problem solver with intimate knowledge of site reliability practices, standard dev ops principles, AWS, scripting languages and Kubernetes.

What You'll Do

Build systems that optimize the uptime and reliability of our platform, and support the management and optimization of our software delivery pipeline, observability and infrastructure operations
Maintain, support, and engineer production and non-production Kubernetes Clusters (EKS) as well as ES, MSK, RDS, and EC (Redis) clusters
Deploy and maintain monitoring and logging solutions based on Prometheus, Loki, Thanos, Grafana, OpenTelemetry and New Relic
Collaborate with cross-functional teams to identify and address potential bottlenecks, optimize resource utilization, and proactively prevent system failures
Define and manage SLO, SLI and error budgets
Develop processes, tools and automation to reduce toil across engineering teams
Plan and forecast service capacity and demand, assess cost optimization, and tune systems and software
Debug production/non-production issues
Take part in 24/7 on-call rotation

Technology We're Using Now

Python, Ruby on Rails, Golang
React/Redux, Objective-C and Swift, Android
Postgres, Redshift, Redis, Kafka
AWS/GCP
Docker/Kubernetes
OpenTelemetry/Prometheus/Thanos/Loki/Grafana/New Relic/Sentry
Git/CircleCI
ArgoCD

What You Bring

6+ Years in various SRE roles
6+ Years in various DevOPS/System Engineering roles
6+ Years of experience building and managing production Kubernetes infrastructure
6+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
Experience with Infrastructure as code such as Terraform or Crossplane
Experience with CI/CD Development tools (CircleCI, etc.)
Experience with GitOPS Tools (ArgoCD)
Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
Must be able to work independently, be self-motivated and handle multiple priorities
Comfortable working in a fast-paced agile environment

Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly.

Why Roadie?

Competitive compensation packages
100% covered health insurance premiums for yourself
401k with company match
Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!)
Flexible work schedule with unlimited PTO
Monthly 3-day weekends
Monthly WFH stipend
Paid sabbatical leave- tenured team members are given time to rest, relax, and explore
The technology you need to get the job done

Apply for this position

Bookmark Report

Senior Site Reliability Engineer

Roadie

What You'll Do

Build systems that optimize the uptime and reliability of our platform, and support the management and optimization of our software delivery pipeline, observability and infrastructure operations
Maintain, support, and engineer production and non-production Kubernetes Clusters (EKS) as well as ES, MSK, RDS, and EC (Redis) clusters
Deploy and maintain monitoring and logging solutions based on Prometheus, Loki, Thanos, Grafana, OpenTelemetry and New Relic
Collaborate with cross-functional teams to identify and address potential bottlenecks, optimize resource utilization, and proactively prevent system failures
Define and manage SLO, SLI and error budgets
Develop processes, tools and automation to reduce toil across engineering teams
Plan and forecast service capacity and demand, assess cost optimization, and tune systems and software
Debug production/non-production issues
Take part in 24/7 on-call rotation

Technology We're Using Now

Python, Ruby on Rails, Golang
React/Redux, Objective-C and Swift, Android
Postgres, Redshift, Redis, Kafka
AWS/GCP
Docker/Kubernetes
OpenTelemetry/Prometheus/Thanos/Loki/Grafana/New Relic/Sentry
Git/CircleCI
ArgoCD

What You Bring

6+ Years in various SRE roles
6+ Years in various DevOPS/System Engineering roles
6+ Years of experience building and managing production Kubernetes infrastructure
6+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
Experience with Infrastructure as code such as Terraform or Crossplane
Experience with CI/CD Development tools (CircleCI, etc.)
Experience with GitOPS Tools (ArgoCD)
Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
Must be able to work independently, be self-motivated and handle multiple priorities
Comfortable working in a fast-paced agile environment

Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly.

Why Roadie?

Competitive compensation packages
100% covered health insurance premiums for yourself
401k with company match
Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!)
Flexible work schedule with unlimited PTO
Monthly 3-day weekends
Monthly WFH stipend
Paid sabbatical leave- tenured team members are given time to rest, relax, and explore
The technology you need to get the job done

About the job

Apply for this position

30,000+
REMOTE JOBS

Senior Site Reliability Engineer

Working Nomads

Jobs by Category

Jobs by Position Type

Jobs by Region

Jobs by Skill

Jobs by Country