Senior Site Reliability Engineer
How You'll Make an Impact:
As a Sr. Site Reliability Engineer, you'll be the guardian of our platform's reliability and performance, ensuring millions of hospitality transactions flow seamlessly across the globe. You'll architect and implement scalable AWS cloud solutions that keep the most ambitious hotels running 24/7, while fostering a culture of automation, resilience, and continuous improvement across our engineering teams.
Our SRE Team:
We're a bottom-up, collaborative team that thrives on healthy debate and shared ownership of our infrastructure. You'll have endless opportunities to influence architecture decisions while working with cutting-edge cloud technologies at scale. We believe the best solutions come from engineers who are empowered to innovate, experiment, and challenge the status quo.
What You Bring to the Team:
Design and implement reliable and scalable AWS architecture to meet the needs of the organization.
Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components.
Support the CICD process with ArgoCD and GitOps.
Automate the platform deployments with Terraform infrastructure-as-code.
Develop and continuously improve product Observability and Monitoring systems based on the Grafana, Prometheus, DataDog, and Cloudwatch.
Respond and participate with Incident Management and Root Cause Analysis, ensuring minimal impact on services.
Optimize system performance and troubleshoot issues as they arise.
Collaborate with development teams to establish monitoring best practices and ensure systems meet reliability targets.
Collaborate with security teams to implement and maintain security best practices.
Infrastructure support rotation providing guidance to other engineering teams.
What Sets You Up for Success:
5+ years of experience as a DevOps or SRE working within the AWS ecosystem.
5+ years of experience with Kubernetes (EKS) and Helm charts.
Experience with designing, building, and supporting CI/CD pipelines with ArgoCD and GitHub actions.
Experience with infrastructure-as-code methodologies with Terraform.
Experience with Observability and Monitoring with Grafana, Prometheus, DataDog, and Cloudwatch.
Experience with Incident Management, full stack troubleshooting, performance analysis and root cause analysis (RCA).
Experience with Web application systems such as Nginx, Ingress controllers, load balancing and Content Delivery Networks.
Experience with Databases (MySQL, PostgreSQL, Aurora) and Middleware technologies (Redis, Memcached and SQS)
Good networking skills with VPC, Security Groups and Network ACLs.
Ability to work remotely and manage your own time in a global team.
Good written and verbal communication in English.
Bachelor’s degree in Computer Science or equivalent experience.
Bonus Skills to Stand Out:
Advanced experience with Database Administration (Aurora, MySQL, PostgreSQL).
Experience working in a PCI-compliant environment.
Experience working with Kong API Gateway.
#LI-IK1
About the job
Apply for this position
Senior Site Reliability Engineer
How You'll Make an Impact:
As a Sr. Site Reliability Engineer, you'll be the guardian of our platform's reliability and performance, ensuring millions of hospitality transactions flow seamlessly across the globe. You'll architect and implement scalable AWS cloud solutions that keep the most ambitious hotels running 24/7, while fostering a culture of automation, resilience, and continuous improvement across our engineering teams.
Our SRE Team:
We're a bottom-up, collaborative team that thrives on healthy debate and shared ownership of our infrastructure. You'll have endless opportunities to influence architecture decisions while working with cutting-edge cloud technologies at scale. We believe the best solutions come from engineers who are empowered to innovate, experiment, and challenge the status quo.
What You Bring to the Team:
Design and implement reliable and scalable AWS architecture to meet the needs of the organization.
Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components.
Support the CICD process with ArgoCD and GitOps.
Automate the platform deployments with Terraform infrastructure-as-code.
Develop and continuously improve product Observability and Monitoring systems based on the Grafana, Prometheus, DataDog, and Cloudwatch.
Respond and participate with Incident Management and Root Cause Analysis, ensuring minimal impact on services.
Optimize system performance and troubleshoot issues as they arise.
Collaborate with development teams to establish monitoring best practices and ensure systems meet reliability targets.
Collaborate with security teams to implement and maintain security best practices.
Infrastructure support rotation providing guidance to other engineering teams.
What Sets You Up for Success:
5+ years of experience as a DevOps or SRE working within the AWS ecosystem.
5+ years of experience with Kubernetes (EKS) and Helm charts.
Experience with designing, building, and supporting CI/CD pipelines with ArgoCD and GitHub actions.
Experience with infrastructure-as-code methodologies with Terraform.
Experience with Observability and Monitoring with Grafana, Prometheus, DataDog, and Cloudwatch.
Experience with Incident Management, full stack troubleshooting, performance analysis and root cause analysis (RCA).
Experience with Web application systems such as Nginx, Ingress controllers, load balancing and Content Delivery Networks.
Experience with Databases (MySQL, PostgreSQL, Aurora) and Middleware technologies (Redis, Memcached and SQS)
Good networking skills with VPC, Security Groups and Network ACLs.
Ability to work remotely and manage your own time in a global team.
Good written and verbal communication in English.
Bachelor’s degree in Computer Science or equivalent experience.
Bonus Skills to Stand Out:
Advanced experience with Database Administration (Aurora, MySQL, PostgreSQL).
Experience working in a PCI-compliant environment.
Experience working with Kong API Gateway.
#LI-IK1
