Site Reliability Engineer

Twilio

Full-time

USA

$152k-$224k per year

engineer

software engineering

devops

java

python

Apply for this position

See yourself at Twilio

Join the team as Twilio’s next Site Reliability Engineer on our Data Infrastructure Platform.

About the job

We are looking for a talented and experienced Software Engineer to join our Data Platform team. In this role, you will play a crucial part in designing, building, and optimizing our platform to support a wide range of data-driven initiatives. You will work closely with cross-functional teams to understand business requirements, architect scalable solutions, and implement data solutions and infrastructure for our Data Platform. The ideal candidate will have a passion for leveraging data to drive business impact, strong technical skills, and experience with modern data technologies.

Responsibilities

In this role, you’ll:

Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis.
Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions.
Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena.
Design and implement frameworks and solutions for performance, reliability, and cost-efficiency.
Ensure data quality, integrity, and security throughout the data lifecycle.
Stay current with emerging technologies and best practices in big data technologies
Mentor early in career engineers and contribute to a culture of continuous learning and improvement

Qualifications

Twilio values diverse experiences from all kinds of industries, and we encourage everyone who meets the required qualifications to apply. If your career is just starting or hasn't followed a traditional path, don't let that stop you from considering Twilio. We are always looking for people who will bring something new to the table!

*Required:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems.
Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability.
Hands-on experience with Kubernetes (preferably EKS), including deploying and managing stateful services and operators in Kubernetes environments.
Deep understanding of AWS cloud services, particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB).
Proficiency in infrastructure-as-code tools, such as Terraform or CloudFormation, for managing and automating infrastructure.
Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency.
Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services.
Experience designing and implementing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams.
Excellent problem-solving and analytical skills.
Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment.

Desired:

Data technologies like Apache Kafka, AWS MSK, Flink, Clickhouse etc.
Bias to action, ability to iterate and ship rapidly
Passion to build data products, prior projects in this area

Location

This role will be remote, but is not eligible to be hired in CA, CT, NJ, NY, PA, WA.

Travel

We prioritize connection and opportunities to build relationships with our customers and each other. For this role, you may be required to travel occasionally to participate in project or team in-person meetings.

What We Offer

Working at Twilio offers many benefits, including competitive pay, generous time off, ample parental and wellness leave, healthcare, a retirement savings program, and much more. Offerings vary by location.

Compensation

*Please note this role is open to candidates outside of California, Colorado, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, New Jersey, New York, Vermont, Washington D.C., and Washington State. The information below is provided for candidates hired in those locations only.

The estimated pay ranges for this role are as follows:

Based in Colorado, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, Vermont or Washington D.C. : 152,500 - 190,600.
Based in New York, New Jersey, Washington State, or California (outside of the San Francisco Bay area): $161,500 - 201,800.
Based in the San Francisco Bay area, California: $179,400 - $224,200.
This role may be eligible to participate in Twilio’s equity plan and corporate bonus plan. All roles are generally eligible for the following benefits: health care insurance, 401(k) retirement account, paid sick time, paid personal time off, paid parental leave.

The successful candidate’s starting salary will be determined based on permissible, non-discriminatory factors such as skills, experience, and geographic location.

Applications for this role are intended to be accepted until Sep 30th, 2025, but may change based on business needs.

Apply for this position

Bookmark Report

Site Reliability Engineer

Twilio

See yourself at Twilio

Join the team as Twilio’s next Site Reliability Engineer on our Data Infrastructure Platform.

About the job

Responsibilities

In this role, you’ll:

Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis.
Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions.
Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena.
Design and implement frameworks and solutions for performance, reliability, and cost-efficiency.
Ensure data quality, integrity, and security throughout the data lifecycle.
Stay current with emerging technologies and best practices in big data technologies
Mentor early in career engineers and contribute to a culture of continuous learning and improvement

Qualifications

*Required:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems.
Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability.
Hands-on experience with Kubernetes (preferably EKS), including deploying and managing stateful services and operators in Kubernetes environments.
Deep understanding of AWS cloud services, particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB).
Proficiency in infrastructure-as-code tools, such as Terraform or CloudFormation, for managing and automating infrastructure.
Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency.
Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services.
Experience designing and implementing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams.
Excellent problem-solving and analytical skills.
Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment.

Desired:

Data technologies like Apache Kafka, AWS MSK, Flink, Clickhouse etc.
Bias to action, ability to iterate and ship rapidly
Passion to build data products, prior projects in this area

Location

This role will be remote, but is not eligible to be hired in CA, CT, NJ, NY, PA, WA.

Travel

What We Offer

Compensation

The estimated pay ranges for this role are as follows:

Based in Colorado, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, Vermont or Washington D.C. : 152,500 - 190,600.
Based in New York, New Jersey, Washington State, or California (outside of the San Francisco Bay area): $161,500 - 201,800.
Based in the San Francisco Bay area, California: $179,400 - $224,200.
This role may be eligible to participate in Twilio’s equity plan and corporate bonus plan. All roles are generally eligible for the following benefits: health care insurance, 401(k) retirement account, paid sick time, paid personal time off, paid parental leave.

The successful candidate’s starting salary will be determined based on permissible, non-discriminatory factors such as skills, experience, and geographic location.

Applications for this role are intended to be accepted until Sep 30th, 2025, but may change based on business needs.

About the job

Apply for this position

30,000+
REMOTE JOBS

Site Reliability Engineer

Working Nomads

Jobs by Category

Jobs by Position Type

Jobs by Region

Jobs by Skill

Jobs by Country