Senior Site Reliability Engineer - Database Operations
An overview of this role
You'll join the Database Operations team as a Senior Site Reliability Engineer focused on keeping PostgreSQL—the backbone of GitLab.com—running smoothly at massive scale. GitLab.com is one of the largest single-tenancy open source SaaS sites in the world, which means you'll tackle genuinely novel challenges around reliability, performance, and scalability that few engineers get to solve. You'll own the full lifecycle of our database infrastructure: automating operational tasks, responding to production incidents, building observability systems that predict capacity needs, and working with engineering teams to optimize how they use our databases. Your work directly impacts millions of developers using GitLab.com and feeds back into products for customers running Dedicated, Self-Managed, and future Cells installations.
Some examples of work you'll do:
Build mature infrastructure automation using Terraform, Chef, and Ansible to manage database operations at scale with minimal manual effort
Respond to production incidents, debug complex database issues, and implement solutions that prevent them from happening again
Design and implement self-service tools that help engineering teams succeed with their database needs, reducing friction across the organization
What you’ll do
Automate operational tasks across all environments, from package updates and configuration changes to provisioning user-facing services, reducing manual overhead and improving system reliability.
Build and evolve the observability stack for PostgreSQL at GitLab.com, ensuring we monitor the right metrics and can predict capacity needs based on usage patterns.
Respond to platform emergencies, alerts, and escalations from Customer Support, working with database reliability engineers and peer SREs to resolve production incidents.
Plan and execute database infrastructure growth, capacity expansions, and service rollouts while working with engineering teams to optimize resource consumption.
Develop and maintain automation tools for database infrastructure that enable self-service capabilities for engineering teams, reducing operational friction.
Analyze and implement best practices for PostgreSQL database clusters and their components, focusing on reliability, performance, and security.
Participate in on-call rotation supporting GitLab.com's production database systems, with ownership for triaging, mitigating, and learning from incidents.
Document all operational actions and learnings so they become repeatable processes and eventually automated solutions, improving institutional knowledge across the team.
What you’ll bring
Experience working as an SRE supporting database operations, with demonstrated expertise managing production database systems.
Hands-on experience running PostgreSQL at scale in large production environments, including understanding of performance tuning, replication, and backup strategies.
Proficiency with infrastructure automation and configuration management tools such as Terraform, Chef, Ansible, or Puppet.
Solid understanding of SQL and PL/pgSQL, with strong data modeling and data structure design skills.
Experience working in large-scale distributed systems production environments, particularly in SaaS settings, with comfort responding to production incidents and on-call responsibilities.
A proactive mindset focused on automation and documentation—you see problems as opportunities to build sustainable solutions rather than temporary fixes.
Comfort working asynchronously across distributed teams and a commitment to GitLab's values of collaboration, transparency, and iteration.
Experience with backend programming languages such as Ruby or Go is valuable but not required; we welcome candidates with diverse technical backgrounds and transferable skills.
About the team
We are responsible for building, running, and evolving the entire lifecycle of the PostgreSQL database engine that powers GitLab.com. You’ll be part of our team focused on owning the reliability, scalability, performance, and security of our database infrastructure and supporting services. GitLab.com is one of the largest single-tenancy open source SaaS sites on the internet, which means your work directly impacts hundreds of thousands of concurrent users worldwide. We operate in a fully distributed, asynchronous environment across multiple regions, collaborating on everything from database automation and infrastructure design to incident response and capacity planning. You’ll be solving novel challenges at scale—from implementing observability stacks that predict capacity needs to designing the infrastructure components that allow GitLab to scale reliably. We continuously seek to reduce complexity and improve efficiency by leveraging cloud vendor managed products and services where appropriate, ensuring GitLab.com remains a best-in-class production environment. For more on how we operate, see Database Operations Team Handbook Page.
Senior Site Reliability Engineer - Database Operations
An overview of this role
You'll join the Database Operations team as a Senior Site Reliability Engineer focused on keeping PostgreSQL—the backbone of GitLab.com—running smoothly at massive scale. GitLab.com is one of the largest single-tenancy open source SaaS sites in the world, which means you'll tackle genuinely novel challenges around reliability, performance, and scalability that few engineers get to solve. You'll own the full lifecycle of our database infrastructure: automating operational tasks, responding to production incidents, building observability systems that predict capacity needs, and working with engineering teams to optimize how they use our databases. Your work directly impacts millions of developers using GitLab.com and feeds back into products for customers running Dedicated, Self-Managed, and future Cells installations.
Some examples of work you'll do:
Build mature infrastructure automation using Terraform, Chef, and Ansible to manage database operations at scale with minimal manual effort
Respond to production incidents, debug complex database issues, and implement solutions that prevent them from happening again
Design and implement self-service tools that help engineering teams succeed with their database needs, reducing friction across the organization
What you’ll do
Automate operational tasks across all environments, from package updates and configuration changes to provisioning user-facing services, reducing manual overhead and improving system reliability.
Build and evolve the observability stack for PostgreSQL at GitLab.com, ensuring we monitor the right metrics and can predict capacity needs based on usage patterns.
Respond to platform emergencies, alerts, and escalations from Customer Support, working with database reliability engineers and peer SREs to resolve production incidents.
Plan and execute database infrastructure growth, capacity expansions, and service rollouts while working with engineering teams to optimize resource consumption.
Develop and maintain automation tools for database infrastructure that enable self-service capabilities for engineering teams, reducing operational friction.
Analyze and implement best practices for PostgreSQL database clusters and their components, focusing on reliability, performance, and security.
Participate in on-call rotation supporting GitLab.com's production database systems, with ownership for triaging, mitigating, and learning from incidents.
Document all operational actions and learnings so they become repeatable processes and eventually automated solutions, improving institutional knowledge across the team.
What you’ll bring
Experience working as an SRE supporting database operations, with demonstrated expertise managing production database systems.
Hands-on experience running PostgreSQL at scale in large production environments, including understanding of performance tuning, replication, and backup strategies.
Proficiency with infrastructure automation and configuration management tools such as Terraform, Chef, Ansible, or Puppet.
Solid understanding of SQL and PL/pgSQL, with strong data modeling and data structure design skills.
Experience working in large-scale distributed systems production environments, particularly in SaaS settings, with comfort responding to production incidents and on-call responsibilities.
A proactive mindset focused on automation and documentation—you see problems as opportunities to build sustainable solutions rather than temporary fixes.
Comfort working asynchronously across distributed teams and a commitment to GitLab's values of collaboration, transparency, and iteration.
Experience with backend programming languages such as Ruby or Go is valuable but not required; we welcome candidates with diverse technical backgrounds and transferable skills.
About the team
We are responsible for building, running, and evolving the entire lifecycle of the PostgreSQL database engine that powers GitLab.com. You’ll be part of our team focused on owning the reliability, scalability, performance, and security of our database infrastructure and supporting services. GitLab.com is one of the largest single-tenancy open source SaaS sites on the internet, which means your work directly impacts hundreds of thousands of concurrent users worldwide. We operate in a fully distributed, asynchronous environment across multiple regions, collaborating on everything from database automation and infrastructure design to incident response and capacity planning. You’ll be solving novel challenges at scale—from implementing observability stacks that predict capacity needs to designing the infrastructure components that allow GitLab to scale reliably. We continuously seek to reduce complexity and improve efficiency by leveraging cloud vendor managed products and services where appropriate, ensuring GitLab.com remains a best-in-class production environment. For more on how we operate, see Database Operations Team Handbook Page.
