Intermediate Site Reliability Engineer - Tenant Scale: Tenant Services
An overview of this role
As a Site Reliability Engineer (SRE) at GitLab, you keep GitLab.com and other production systems running smoothly for millions of users by combining pragmatic operations with strong software engineering practices. You focus on the systems layer (operating systems, storage, networking) and edge services and Kubernetes workloads, designing and operating highly scalable, reliable, and secure infrastructure that supports one of the largest single-tenancy open source SaaS sites on the Internet. You’ll work across the Infrastructure organization to automate away toil, improve availability and performance, and respond to incidents during your local daytime hours as part of a globally distributed on-call rotation. In this role, you’ll help Tenant Services safeguard and scale customer data while increasing automation so GitLab can continue to grow with enterprise-level expectations for reliability and availability.
What you’ll do
Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.
Participate in a global on-call rotation during your local daytime hours, respond to production incidents, and contribute to clear, constructive incident reviews.
Reduce toil by automating operational tasks and building tools that improve reliability, availability, and scalability.
Apply infrastructure as code and configuration management practices to manage cloud resources and environments consistently.
Write and maintain production-quality code, preferably in Go or Ruby, to enhance our systems and automation toolchain.
What you’ll bring
Background working with the Kubernetes ecosystem, including tools such as Helm, and running production workloads.
Experience operating cloud infrastructure on platforms like Google Cloud Platform or Amazon Web Services, especially networking, hosted Kubernetes services, and scaling.
Hands-on practice with infrastructure as code and configuration management tools such as Ansible or Chef.
Strong programming skills in a modern language, preferably Go or Ruby, applied to automation and reliability problems.
Ability to clearly define problems, think beyond short-term fixes, and design solutions that improve systems over time.
Consistent focus on reducing toil through automation and thoughtful system design.
Independent, proactive working style with a bias for action and comfort operating as a “manager of one” in a distributed, asynchronous environment.
Clear written and verbal communication skills, with openness to candidates who bring transferable experience from related reliability, infrastructure, or platform roles.
About the team
Tenant Services is the team responsible for safeguarding and securing customer data stored by the GitLab application and for setting clear guidelines for how that data is accessed. The team runs the largest GitLab instance in existence, and one of the largest single-tenancy open source SaaS sites on the Internet, which means you’ll work on unique scale and reliability challenges that impact users every day. As an all-remote, globally distributed group, Tenant Services collaborates asynchronously across time zones and leans heavily on automation to meet enterprise expectations for reliability, availability, and data protection while continuing to scale. For more on how this team works, see our Team Handbook page.The Tenant Services team at GitLab is responsible for safeguarding and securing customer data stored by the GitLab application and for setting clear guidelines for how that data is accessed. We run the largest GitLab instance in existence, and one of the largest single-tenancy open source SaaS sites on the Internet, which means you’ll work on unique scale and reliability challenges that impact users every day. As an all-remote, globally distributed group, we collaborate asynchronously across time zones and lean heavily on automation to meet enterprise expectations for reliability, availability, and data protection while continuing to scale. For more on how we work, see our Team Handbook page.
About the job
Apply for this position
Intermediate Site Reliability Engineer - Tenant Scale: Tenant Services
An overview of this role
As a Site Reliability Engineer (SRE) at GitLab, you keep GitLab.com and other production systems running smoothly for millions of users by combining pragmatic operations with strong software engineering practices. You focus on the systems layer (operating systems, storage, networking) and edge services and Kubernetes workloads, designing and operating highly scalable, reliable, and secure infrastructure that supports one of the largest single-tenancy open source SaaS sites on the Internet. You’ll work across the Infrastructure organization to automate away toil, improve availability and performance, and respond to incidents during your local daytime hours as part of a globally distributed on-call rotation. In this role, you’ll help Tenant Services safeguard and scale customer data while increasing automation so GitLab can continue to grow with enterprise-level expectations for reliability and availability.
What you’ll do
Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.
Participate in a global on-call rotation during your local daytime hours, respond to production incidents, and contribute to clear, constructive incident reviews.
Reduce toil by automating operational tasks and building tools that improve reliability, availability, and scalability.
Apply infrastructure as code and configuration management practices to manage cloud resources and environments consistently.
Write and maintain production-quality code, preferably in Go or Ruby, to enhance our systems and automation toolchain.
What you’ll bring
Background working with the Kubernetes ecosystem, including tools such as Helm, and running production workloads.
Experience operating cloud infrastructure on platforms like Google Cloud Platform or Amazon Web Services, especially networking, hosted Kubernetes services, and scaling.
Hands-on practice with infrastructure as code and configuration management tools such as Ansible or Chef.
Strong programming skills in a modern language, preferably Go or Ruby, applied to automation and reliability problems.
Ability to clearly define problems, think beyond short-term fixes, and design solutions that improve systems over time.
Consistent focus on reducing toil through automation and thoughtful system design.
Independent, proactive working style with a bias for action and comfort operating as a “manager of one” in a distributed, asynchronous environment.
Clear written and verbal communication skills, with openness to candidates who bring transferable experience from related reliability, infrastructure, or platform roles.
About the team
Tenant Services is the team responsible for safeguarding and securing customer data stored by the GitLab application and for setting clear guidelines for how that data is accessed. The team runs the largest GitLab instance in existence, and one of the largest single-tenancy open source SaaS sites on the Internet, which means you’ll work on unique scale and reliability challenges that impact users every day. As an all-remote, globally distributed group, Tenant Services collaborates asynchronously across time zones and leans heavily on automation to meet enterprise expectations for reliability, availability, and data protection while continuing to scale. For more on how this team works, see our Team Handbook page.The Tenant Services team at GitLab is responsible for safeguarding and securing customer data stored by the GitLab application and for setting clear guidelines for how that data is accessed. We run the largest GitLab instance in existence, and one of the largest single-tenancy open source SaaS sites on the Internet, which means you’ll work on unique scale and reliability challenges that impact users every day. As an all-remote, globally distributed group, we collaborate asynchronously across time zones and lean heavily on automation to meet enterprise expectations for reliability, availability, and data protection while continuing to scale. For more on how we work, see our Team Handbook page.
