Tech Lead - Site Reliability Engineer
About the role
Ditto is at an inflection point. As we scale to meet the growing demands of our enterprise customers, we need experienced SRE Leads to drive and mature our Site Reliability Engineering practice.
This is a unique opportunity to play a leading role in shaping enterprise-grade reliability, observability and incident management to ensure Ditto's systems meet the high standards our customers expect.
As a Lead SRE for one of our three globally distributed SRE squads, you'll set the standard for best-in-class reliability engineering, while leading and mentoring your squad members. You'll partner closely with product engineering teams to improve system resilience and operational excellence.
As a Lead Site Reliability Engineer, you will:
Line manage your regional squad of SREs, providing leadership and setting the standard for enterprise ready reliability
Develop a high-performing team through mentoring, coaching, and creating growth opportunities for engineers
Engage with incident management and escalations, ensuring your squad sees continual improvement in incident response and actively owns follow ups
Architect enterprise-grade observability solutions across complex distributed systems
Actively lead and manage SREs initiatives, co-ordinating across teams where needed
Guide the implementation of SLIs, SLO and SLAs that align with business objectives
Establish best practices for documentation, runbooks, and knowledge sharing across engineering
Play an active roll in on-call, and manage your squad’s rotation
What you'll need:
7+ years of experience in Site Reliability Engineering or similar DevOps roles with a focus on system reliability and incident management
3+ years of experience leading and mentoring technical teams
Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
Proficiency in at least one systems programming language, such as Go, Rust, C or Java
Experience with Infrastructure as Code tools, like Terraform and Helm
Hands-on experience architecting applications for Kubernetes, and managing Kubernetes infrastructure
Experience with AWS and at least one other major cloud service provider (GCP, Azure)
Excellent communication skills, you’ll set the standard for clear and succinct communication in incidents, hand-offs and project updates
Experience maintaining on-call rotations and incident response procedures
A high degree of agency, taking ownership of problems and identifying initiatives and improvements
Proven project management skills and the ability to balance competing priorities and interrupts
Understanding of security best practices in cloud environments
Nice to have:
Experience directly line managing SREs
Experience building or operating multi-tenant, multi-cloud SaaS/DBaaS Platforms
Familiarity with edge computing or mesh networking
Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
Experience working with globally distributed teams across EMEA and APAC regions
Tech Lead - Site Reliability Engineer
About the role
Ditto is at an inflection point. As we scale to meet the growing demands of our enterprise customers, we need experienced SRE Leads to drive and mature our Site Reliability Engineering practice.
This is a unique opportunity to play a leading role in shaping enterprise-grade reliability, observability and incident management to ensure Ditto's systems meet the high standards our customers expect.
As a Lead SRE for one of our three globally distributed SRE squads, you'll set the standard for best-in-class reliability engineering, while leading and mentoring your squad members. You'll partner closely with product engineering teams to improve system resilience and operational excellence.
As a Lead Site Reliability Engineer, you will:
Line manage your regional squad of SREs, providing leadership and setting the standard for enterprise ready reliability
Develop a high-performing team through mentoring, coaching, and creating growth opportunities for engineers
Engage with incident management and escalations, ensuring your squad sees continual improvement in incident response and actively owns follow ups
Architect enterprise-grade observability solutions across complex distributed systems
Actively lead and manage SREs initiatives, co-ordinating across teams where needed
Guide the implementation of SLIs, SLO and SLAs that align with business objectives
Establish best practices for documentation, runbooks, and knowledge sharing across engineering
Play an active roll in on-call, and manage your squad’s rotation
What you'll need:
7+ years of experience in Site Reliability Engineering or similar DevOps roles with a focus on system reliability and incident management
3+ years of experience leading and mentoring technical teams
Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
Proficiency in at least one systems programming language, such as Go, Rust, C or Java
Experience with Infrastructure as Code tools, like Terraform and Helm
Hands-on experience architecting applications for Kubernetes, and managing Kubernetes infrastructure
Experience with AWS and at least one other major cloud service provider (GCP, Azure)
Excellent communication skills, you’ll set the standard for clear and succinct communication in incidents, hand-offs and project updates
Experience maintaining on-call rotations and incident response procedures
A high degree of agency, taking ownership of problems and identifying initiatives and improvements
Proven project management skills and the ability to balance competing priorities and interrupts
Understanding of security best practices in cloud environments
Nice to have:
Experience directly line managing SREs
Experience building or operating multi-tenant, multi-cloud SaaS/DBaaS Platforms
Familiarity with edge computing or mesh networking
Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
Experience working with globally distributed teams across EMEA and APAC regions