Cloud Operations Engineer
Job Title: Cloud Operations Engineer
How You'll Make an Impact:
As a Cloud Operations Engineer, you’ll be the frontline support for our global infrastructure, playing a key role in ensuring 24/7 operational stability across our AWS-based environment. Your core responsibilities will include monitoring critical systems through platforms such as Datadog, PagerDuty, and CloudWatch, rapidly validating alerts, and escalating verified incidents based on clearly defined protocols.
You’ll execute operational tasks, follow documented procedures for common issues, and manage standard maintenance activities. You'll also have opportunities to collaborate directly with senior engineers across SRE, DevOps, and Infrastructure teams, contributing to the resolution of a wide range of technical challenges and gaining exposure to complex, real-world systems.
Acting as the central communication point during incidents, you’ll maintain clear, timely updates to stakeholders and facilitate smooth transitions between engineering and support teams.
Our Network Operations Team:
You’ll be joining a brand-new team at the ground level, helping shape the future of SaaS operations for a company undergoing exciting growth. Working closely with SRE, DevOps, Security, and various Workload teams, you’ll be at the heart of collaborative problem-solving and operational innovation. It’s a rare chance to build, influence, and grow in a highly visible and impactful role.
This role offers a rare opportunity to gain deep, hands-on experience in cloud operations and incident management while working alongside high-performing engineering teams. You'll build the foundation for growth into specialized areas like SRE, DevOps, or Infrastructure Engineering, with direct exposure to real-world systems at scale.
What You Bring to the Team:
Support Kubernetes (EKS) environments by performing operational checks, validating pod health, reviewing logs, and assisting with incident triage during deployments and scaling events
Assist with CI/CD pipeline operations by supporting deployments, rollbacks, and release verification in collaboration with DevOps and platform engineering teams using ArgoCD and GitHub
Execute Infrastructure as Code changes and standard operating procedures using Terraform across cloud infrastructure and application services
Monitor, triage, and validate incidents using observability and alerting tools such as PagerDuty, Datadog, Amazon CloudWatch, Prometheus, and Grafana, escalating to SRE, DevOps, or application teams as appropriate
Execute documented runbooks and SOPs to resolve common operational issues, including basic AWS troubleshooting, infrastructure access requests (SSO, VPN, IAM), and deployment support
Perform routine operational tasks such as configuration changes, maintenance activities, and standard change requests across cloud infrastructure and application services
Contribute to operational excellence by maintaining and improving runbooks, updating documentation, and participating in post-incident reviews (RCA) to drive reliability improvements
What Sets You Up for Success:
3-4 years of hands-on experience in DevOps, Site Reliability Engineering (SRE), or related operational roles with focus on cloud infrastructure
Practical experience with Amazon EKS (Elastic Kubernetes Service) or other managed Kubernetes platforms, including container orchestration and operational management
Hands-on experience with CI/CD and GitOps deployment tools, particularly ArgoCD, Flux, or similar automation platforms
Experience using Infrastructure as Code tools, specifically Terraform, for managing and automating cloud infrastructure
Foundational understanding of the AWS service ecosystem including core infrastructure services (EC2, S3, RDS, IAM, VPC)
Strong written and verbal communication skills in English with ability to provide clear, timely updates during high-pressure incidents
Detail-oriented with strong documentation skills and ability to collaborate effectively across multiple teams in a fully remote, global environment
Bonus Skills to Stand Out (Optional):
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or Amazon CloudWatch
Prior experience working in a 24/7 operations environment with hands-on use of PagerDuty or similar on-call and alerting systems
Ability to write (not just read) Bash or Python scripts for automation tasks
About the job
Apply for this position
Cloud Operations Engineer
Job Title: Cloud Operations Engineer
How You'll Make an Impact:
As a Cloud Operations Engineer, you’ll be the frontline support for our global infrastructure, playing a key role in ensuring 24/7 operational stability across our AWS-based environment. Your core responsibilities will include monitoring critical systems through platforms such as Datadog, PagerDuty, and CloudWatch, rapidly validating alerts, and escalating verified incidents based on clearly defined protocols.
You’ll execute operational tasks, follow documented procedures for common issues, and manage standard maintenance activities. You'll also have opportunities to collaborate directly with senior engineers across SRE, DevOps, and Infrastructure teams, contributing to the resolution of a wide range of technical challenges and gaining exposure to complex, real-world systems.
Acting as the central communication point during incidents, you’ll maintain clear, timely updates to stakeholders and facilitate smooth transitions between engineering and support teams.
Our Network Operations Team:
You’ll be joining a brand-new team at the ground level, helping shape the future of SaaS operations for a company undergoing exciting growth. Working closely with SRE, DevOps, Security, and various Workload teams, you’ll be at the heart of collaborative problem-solving and operational innovation. It’s a rare chance to build, influence, and grow in a highly visible and impactful role.
This role offers a rare opportunity to gain deep, hands-on experience in cloud operations and incident management while working alongside high-performing engineering teams. You'll build the foundation for growth into specialized areas like SRE, DevOps, or Infrastructure Engineering, with direct exposure to real-world systems at scale.
What You Bring to the Team:
Support Kubernetes (EKS) environments by performing operational checks, validating pod health, reviewing logs, and assisting with incident triage during deployments and scaling events
Assist with CI/CD pipeline operations by supporting deployments, rollbacks, and release verification in collaboration with DevOps and platform engineering teams using ArgoCD and GitHub
Execute Infrastructure as Code changes and standard operating procedures using Terraform across cloud infrastructure and application services
Monitor, triage, and validate incidents using observability and alerting tools such as PagerDuty, Datadog, Amazon CloudWatch, Prometheus, and Grafana, escalating to SRE, DevOps, or application teams as appropriate
Execute documented runbooks and SOPs to resolve common operational issues, including basic AWS troubleshooting, infrastructure access requests (SSO, VPN, IAM), and deployment support
Perform routine operational tasks such as configuration changes, maintenance activities, and standard change requests across cloud infrastructure and application services
Contribute to operational excellence by maintaining and improving runbooks, updating documentation, and participating in post-incident reviews (RCA) to drive reliability improvements
What Sets You Up for Success:
3-4 years of hands-on experience in DevOps, Site Reliability Engineering (SRE), or related operational roles with focus on cloud infrastructure
Practical experience with Amazon EKS (Elastic Kubernetes Service) or other managed Kubernetes platforms, including container orchestration and operational management
Hands-on experience with CI/CD and GitOps deployment tools, particularly ArgoCD, Flux, or similar automation platforms
Experience using Infrastructure as Code tools, specifically Terraform, for managing and automating cloud infrastructure
Foundational understanding of the AWS service ecosystem including core infrastructure services (EC2, S3, RDS, IAM, VPC)
Strong written and verbal communication skills in English with ability to provide clear, timely updates during high-pressure incidents
Detail-oriented with strong documentation skills and ability to collaborate effectively across multiple teams in a fully remote, global environment
Bonus Skills to Stand Out (Optional):
Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or Amazon CloudWatch
Prior experience working in a 24/7 operations environment with hands-on use of PagerDuty or similar on-call and alerting systems
Ability to write (not just read) Bash or Python scripts for automation tasks
