Reliability Architect

Twilio

Full-time

USA, Canada

devops

aws

architecture

saas

cloud

Apply for this position

See yourself at Twilio

Join the team as Twilio’s next Reliability Architect.

About the job

As an Architect in SRE, you will drive the technical strategy, vision and outcomes for Twilio’s Reliability Engineering organization. You will define and lead solutions and initiatives that ensure Twilio products are reliable worldwide, and you will define standards and guide engineering teams on best practices for designing, building, and operating resilient systems. This role is pivotal to Twilio’s commitment to operational excellence, scalability, and pragmatic, large-scale systems design in the cloud.

Responsibilities

In this role, you’ll:

Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.
Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.
Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;
Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.
Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.
Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling.
Establish and champion reliability practices and drive systemic improvements.
Mentor and grow engineers and technical leaders
Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.

Qualifications

Twilio values diverse experiences from all kinds of industries, and we encourage everyone who meets the required qualifications to apply. If your career is just starting or hasn't followed a traditional path, don't let that stop you from considering Twilio. We are always looking for people who will bring something new to the table!

*Required:

Excellent generalist knowledge of software engineering and delivery.
In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization.
Previous experience driving cross-org technical architecture outcomes.
Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
10+ years of experience in Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure, backend systems, and reliability.
Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.
Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS.
Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure.
Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting.
Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling.
Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
Experience running cross-functional post-incident reviews and driving improvements.
Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.
Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments.
Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs.
Excellent written and verbal communication skills.
Ability to influence and build effective working relationships with all levels of the organization.

Desired:

Specific experience owning and operating large AWS footprints.
Knowledge of Kubernetes architecture and concepts.
Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming.
Passion for building reliable products, with prior projects in high-availability systems

Location

This role will be remote, and based in Ireland.

Travel

We prioritize connection and opportunities to build relationships with our customers and each other. For this role, you may be required to travel occasionally to participate in project or team in-person meetings.

What We Offer

Working at Twilio offers many benefits, including competitive pay, generous time off, ample parental and wellness leave, healthcare, a retirement savings program, and much more. Offerings vary by location.

Apply for this position

Bookmark Report

Reliability Architect

Twilio

See yourself at Twilio

Join the team as Twilio’s next Reliability Architect.

About the job

Responsibilities

In this role, you’ll:

Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.
Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.
Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;
Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.
Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.
Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling.
Establish and champion reliability practices and drive systemic improvements.
Mentor and grow engineers and technical leaders
Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.

Qualifications

*Required:

Excellent generalist knowledge of software engineering and delivery.
In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization.
Previous experience driving cross-org technical architecture outcomes.
Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
10+ years of experience in Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure, backend systems, and reliability.
Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.
Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS.
Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure.
Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting.
Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling.
Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
Experience running cross-functional post-incident reviews and driving improvements.
Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.
Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments.
Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs.
Excellent written and verbal communication skills.
Ability to influence and build effective working relationships with all levels of the organization.

Desired:

Specific experience owning and operating large AWS footprints.
Knowledge of Kubernetes architecture and concepts.
Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming.
Passion for building reliable products, with prior projects in high-availability systems

Location

This role will be remote, and based in Ireland.

Travel

What We Offer

About the job

Apply for this position

30,000+
REMOTE JOBS

Reliability Architect

Working Nomads

Jobs by Category

Jobs by Position Type

Jobs by Region

Jobs by Skill

Jobs by Country