Director - Reliability Engineering
POS-31619
Director, Reliability Engineering
Role Summary
Our mission at HubSpot is to help millions of organizations grow better. HubSpot’s engineering organization has grown to more than 2,000 engineers shipping across thousands of services and deploying thousands of times per day. As HubSpot has become core infrastructure for over 200,000 customers worldwide, reliability isn’t just a priority — it’s foundational to customer trust and business growth.
Our Reliability Engineering team has matured from an early SRE function into a strategic pillar within Platform Infrastructure. The team has driven a 76% reduction in critical incidents while the platform scaled 19x in deployables, established company-wide SLO frameworks, and built the incident management practices that keep HubSpot running.
Now we’re entering the next phase: leveraging AI and agentic approaches to fundamentally transform how we detect, respond to, and prevent outages. As Director of Reliability Engineering, you’ll lead this evolution — deepening our reliability capabilities, pioneering AI-assisted operations, and ensuring HubSpot remains a platform customers can confidently bet their business on.
What You’ll Do
Lead and Develop the Team
Lead a team of ~20 reliability engineers, fostering a culture of operational excellence, continuous learning, and customer obsession
Attract, develop, and retain top talent; build career paths that keep engineers engaged and growing
Own Reliability Strategy
Define and drive HubSpot's reliability roadmap, balancing proactive resilience investments with reactive incident reduction
Partner with Infrastructure leadership to prioritize reliability initiatives alongside cost, performance, and platform evolution
Set and evolve SLO standards that align engineering effort with customer experience
Pioneer AI-Driven Operations
Lead the strategy for integrating AI and agentic approaches into incident detection, diagnosis, and mitigation-reducing time-to-resolution and human toil
Explore and implement AI-assisted tooling for pattern recognition across incidents, automated runbook execution, and predictive reliability insights
Build intelligent systems that learn from our operational history, proactively surface risks, and recommend-or execute-mitigation actions
Balance automation with human judgment-designing systems where AI augments engineers rather than creating blind spots
Drive Company-Wide Impact
Own incident management end-to-end: response coordination, executive communication during major incidents, and blameless post-incident reviews that drive systemic improvement
Influence engineering culture across 100+ product teams-evangelizing reliability practices without compromising team autonomy
Identify systemic risks across the platform and drive cross-functional mitigation efforts
Represent Reliability at the Executive Level
Serve as the voice of reliability in leadership forums, translating technical risk into business terms
Communicate transparently with customers and stakeholders during and after operational incidents
Partner with peer directors across Infrastructure, Product Engineering, and Security to align on shared priorities
What You’ll Bring
Required Qualifications
10+ years of experience in software engineering, SRE, or infrastructure, with 5+ years leading teams
Track record of building and scaling reliability functions at companies with significant operational complexity
Deep technical fluency-you can dive into architecture discussions, incident analysis, and system design with credibility
Curiosity and vision for how AI/ML can transform operations; experience with or strong interest in AIOps, agentic automation, or ML-driven observability is a plus
Proven ability to drive cultural and process change across a large engineering organization without top-down mandates
Strong executive communication skills; comfortable leading incident bridges, presenting to leadership, and representing reliability externally
Experience with modern cloud infrastructure (AWS preferred), observability tooling, and incident management practices
A philosophy that balances reliability with velocity-you understand that the goal is sustainable speed, not gates
Why This Role
This is a high-visibility, high-impact leadership role at an inflection point. You'll own one of Infrastructure's four core pillars at a company where platform stability directly enables customer growth. You'll have the mandate to shape how AI transforms operational practices-not just at
About the job
Apply for this position
Director - Reliability Engineering
POS-31619
Director, Reliability Engineering
Role Summary
Our mission at HubSpot is to help millions of organizations grow better. HubSpot’s engineering organization has grown to more than 2,000 engineers shipping across thousands of services and deploying thousands of times per day. As HubSpot has become core infrastructure for over 200,000 customers worldwide, reliability isn’t just a priority — it’s foundational to customer trust and business growth.
Our Reliability Engineering team has matured from an early SRE function into a strategic pillar within Platform Infrastructure. The team has driven a 76% reduction in critical incidents while the platform scaled 19x in deployables, established company-wide SLO frameworks, and built the incident management practices that keep HubSpot running.
Now we’re entering the next phase: leveraging AI and agentic approaches to fundamentally transform how we detect, respond to, and prevent outages. As Director of Reliability Engineering, you’ll lead this evolution — deepening our reliability capabilities, pioneering AI-assisted operations, and ensuring HubSpot remains a platform customers can confidently bet their business on.
What You’ll Do
Lead and Develop the Team
Lead a team of ~20 reliability engineers, fostering a culture of operational excellence, continuous learning, and customer obsession
Attract, develop, and retain top talent; build career paths that keep engineers engaged and growing
Own Reliability Strategy
Define and drive HubSpot's reliability roadmap, balancing proactive resilience investments with reactive incident reduction
Partner with Infrastructure leadership to prioritize reliability initiatives alongside cost, performance, and platform evolution
Set and evolve SLO standards that align engineering effort with customer experience
Pioneer AI-Driven Operations
Lead the strategy for integrating AI and agentic approaches into incident detection, diagnosis, and mitigation-reducing time-to-resolution and human toil
Explore and implement AI-assisted tooling for pattern recognition across incidents, automated runbook execution, and predictive reliability insights
Build intelligent systems that learn from our operational history, proactively surface risks, and recommend-or execute-mitigation actions
Balance automation with human judgment-designing systems where AI augments engineers rather than creating blind spots
Drive Company-Wide Impact
Own incident management end-to-end: response coordination, executive communication during major incidents, and blameless post-incident reviews that drive systemic improvement
Influence engineering culture across 100+ product teams-evangelizing reliability practices without compromising team autonomy
Identify systemic risks across the platform and drive cross-functional mitigation efforts
Represent Reliability at the Executive Level
Serve as the voice of reliability in leadership forums, translating technical risk into business terms
Communicate transparently with customers and stakeholders during and after operational incidents
Partner with peer directors across Infrastructure, Product Engineering, and Security to align on shared priorities
What You’ll Bring
Required Qualifications
10+ years of experience in software engineering, SRE, or infrastructure, with 5+ years leading teams
Track record of building and scaling reliability functions at companies with significant operational complexity
Deep technical fluency-you can dive into architecture discussions, incident analysis, and system design with credibility
Curiosity and vision for how AI/ML can transform operations; experience with or strong interest in AIOps, agentic automation, or ML-driven observability is a plus
Proven ability to drive cultural and process change across a large engineering organization without top-down mandates
Strong executive communication skills; comfortable leading incident bridges, presenting to leadership, and representing reliability externally
Experience with modern cloud infrastructure (AWS preferred), observability tooling, and incident management practices
A philosophy that balances reliability with velocity-you understand that the goal is sustainable speed, not gates
Why This Role
This is a high-visibility, high-impact leadership role at an inflection point. You'll own one of Infrastructure's four core pillars at a company where platform stability directly enables customer growth. You'll have the mandate to shape how AI transforms operational practices-not just at
