MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Intermediate Site Reliability Engineer - Database Operations

GitLab

Full-time
Canada, New Zealand
operations
engineer
devops
sql
architecture
Apply for this position

 

Int. Site Reliability Engineer: Database Operations

An overview of this role

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.

The Database Operations team’s mission is to build, run, own and evolve the entire lifecycle of the PostgreSQL database engine for GitLab.com. The team is focused on owning the reliability, scalability, evolution, performance & security of the database engine and its supporting services. The team should be seeking to build their services on top of Reliability::Foundations services and cloud vendor managed products, where appropriate, to reduce complexity, improve efficiency and deliver new capabilities quicker. 

GitLab.com is a unique site and it brings unique challenges–it’s the biggest GitLab instance in existence. In fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations

Responsibilities

  • Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.

  • Responding to platform emergencies, alerts, and escalations from Customer Support.

  • Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.

  • Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.

  • Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimize their resource consumption.

As an SRE you will:

  • Work on database reliability and performance aspects for GitLab.com from within the SRE team as well as work on shipping solutions with the product.

  • Analyze solutions and implement best practices for our PostgreSQL database clusters and its components.

  • Work on observability of relevant database metrics and make sure we reach our database objectives.

  • Work with peer SREs to roll out changes to our production environment and help mitigate database-related production incidents.

  • OnCall support on rotation with the team.

  • Provide database expertise to engineering teams (for example through reviews of database migrations, queries and performance optimizations).

  • Work on automation of database infrastructure and help engineering succeed by providing self-service tools.

  • Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible.

  • Plan the growth of GitLab's database infrastructure.

  • Design, build and maintain core database infrastructure components that allow GitLab to scale to support hundreds of thousands of concurrent users.

  • Support and debug database production issues across services and levels of the stack.

  • Make monitoring and alerting alert on symptoms and not on outages.

  • Document every action so your learnings turn into repeatable actions and then into automation.

You may be a fit to this role if you:

  • Have primary experience running PostgreSQL in high-growth, large production environments using both self-managed (VM, Kubernetes with modern PostgreSQL Operators) as well DBaaS services.

  • Have hands-on experience using data from PostgreSQL internals to design, build and troubleshoot systems.

  • Have primary experience with infrastructure automation, orchestration and configuration management (Chef, Ansible, Puppet, Terraform)

  • Have solid understanding of SQL and PL/pgSQL

  • Significant experience working in a Large SaaS distributed Systems production environment

  • Share our values, and work in accordance with those values.

  • Have excellent written and verbal English communication skills, with an urge to collaborate and communicate asynchronously.

  • Have an urge to document all the things so you don't need to learn the same thing twice, and an urge for delivering quickly and iterating fast.

  • Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it

  • Solid data modeling and data structure design skills

  • Bonus: Solid programming skills as a (former) backend engineer - Preferably with Ruby and/or Go.

  • Bonus: Experience with Clickhouse, or other modern OLAP database.

Projects you could work on:

  • Review, analyze and implement solutions regarding database administration (e.g., backups, performance tuning)

  • Work with Ansible, Terraform, Chef and other tools to build mature automation (automate setup of new replicas or testing and monitoring of backups).

  • Implement self-service tools for our engineers using GitLab ChatOps.

  • Provide technical assistance and support to other teams on database and database-related application design methodologies, system resources, application tuning.

  • Review database related changes from engineering teams (e.g., database migrations).

  • Recommend query and schema changes to optimize the performance of database queries.

  • Jump on a production incident to mitigate database-related issues on GitLab.com.

  • Participate actively in the infrastructure design and scalability considerations focusing on data storage aspects.

  • Make sure we know how to take the next step to scale the database.

  • Design and develop specifications for future database requirements including enhancements, upgrades, and capacity planning; evaluate alternatives; and make appropriate recommendations.

Intermediate Site Reliability Engineer Criteria

Technical:

  • Expertise in at least 1 area of SRE work, with general knowledge of all areas.

  • Capable of mentoring Junior team members.

  • Contributes small improvements to the GitLab codebase to resolve issues.

Execution:

  • Identifies projects that result in substantial cost savings or revenue

  • Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.

  • Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.

  • Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.

  • Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

  • Ability to thrive in a fully remote, asynchronous work environment that places a high emphasis on documentation and written communication.

  • Develop expertise in a domain and radiate that knowledge

  • Participate in blameless RCAs on incidents and outages, looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

  • Lead Junior SREs by setting the example.

  • Develop ownership of a major part of the infrastructure.

  • Trusted to de-escalate conflicts inside the team

Performance Indicators

Site Reliability Engineers have the following job-family performance indicators:

  • GitLab.com Availability

  • GitLab.com Performance

  • Apdex and Error SLO per Service

  • Mean Time to Detection

  • Mean Time to Resolution

  • Mean Time Between Failure

  • Mean Time to Production

  • Disaster Recovery Time to Recovery

Apply for this position
Bookmark Report

About the job

Full-time
Canada, New Zealand
Posted 3 weeks ago
operations
engineer
devops
sql
architecture

Apply for this position

Bookmark
Report
Enhancv advertisement

30,000+
REMOTE JOBS

Unlock access to our database and
kickstart your remote career
Join Premium

Intermediate Site Reliability Engineer - Database Operations

GitLab

 

Int. Site Reliability Engineer: Database Operations

An overview of this role

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.

The Database Operations team’s mission is to build, run, own and evolve the entire lifecycle of the PostgreSQL database engine for GitLab.com. The team is focused on owning the reliability, scalability, evolution, performance & security of the database engine and its supporting services. The team should be seeking to build their services on top of Reliability::Foundations services and cloud vendor managed products, where appropriate, to reduce complexity, improve efficiency and deliver new capabilities quicker. 

GitLab.com is a unique site and it brings unique challenges–it’s the biggest GitLab instance in existence. In fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations

Responsibilities

  • Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.

  • Responding to platform emergencies, alerts, and escalations from Customer Support.

  • Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.

  • Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.

  • Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimize their resource consumption.

As an SRE you will:

  • Work on database reliability and performance aspects for GitLab.com from within the SRE team as well as work on shipping solutions with the product.

  • Analyze solutions and implement best practices for our PostgreSQL database clusters and its components.

  • Work on observability of relevant database metrics and make sure we reach our database objectives.

  • Work with peer SREs to roll out changes to our production environment and help mitigate database-related production incidents.

  • OnCall support on rotation with the team.

  • Provide database expertise to engineering teams (for example through reviews of database migrations, queries and performance optimizations).

  • Work on automation of database infrastructure and help engineering succeed by providing self-service tools.

  • Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible.

  • Plan the growth of GitLab's database infrastructure.

  • Design, build and maintain core database infrastructure components that allow GitLab to scale to support hundreds of thousands of concurrent users.

  • Support and debug database production issues across services and levels of the stack.

  • Make monitoring and alerting alert on symptoms and not on outages.

  • Document every action so your learnings turn into repeatable actions and then into automation.

You may be a fit to this role if you:

  • Have primary experience running PostgreSQL in high-growth, large production environments using both self-managed (VM, Kubernetes with modern PostgreSQL Operators) as well DBaaS services.

  • Have hands-on experience using data from PostgreSQL internals to design, build and troubleshoot systems.

  • Have primary experience with infrastructure automation, orchestration and configuration management (Chef, Ansible, Puppet, Terraform)

  • Have solid understanding of SQL and PL/pgSQL

  • Significant experience working in a Large SaaS distributed Systems production environment

  • Share our values, and work in accordance with those values.

  • Have excellent written and verbal English communication skills, with an urge to collaborate and communicate asynchronously.

  • Have an urge to document all the things so you don't need to learn the same thing twice, and an urge for delivering quickly and iterating fast.

  • Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it

  • Solid data modeling and data structure design skills

  • Bonus: Solid programming skills as a (former) backend engineer - Preferably with Ruby and/or Go.

  • Bonus: Experience with Clickhouse, or other modern OLAP database.

Projects you could work on:

  • Review, analyze and implement solutions regarding database administration (e.g., backups, performance tuning)

  • Work with Ansible, Terraform, Chef and other tools to build mature automation (automate setup of new replicas or testing and monitoring of backups).

  • Implement self-service tools for our engineers using GitLab ChatOps.

  • Provide technical assistance and support to other teams on database and database-related application design methodologies, system resources, application tuning.

  • Review database related changes from engineering teams (e.g., database migrations).

  • Recommend query and schema changes to optimize the performance of database queries.

  • Jump on a production incident to mitigate database-related issues on GitLab.com.

  • Participate actively in the infrastructure design and scalability considerations focusing on data storage aspects.

  • Make sure we know how to take the next step to scale the database.

  • Design and develop specifications for future database requirements including enhancements, upgrades, and capacity planning; evaluate alternatives; and make appropriate recommendations.

Intermediate Site Reliability Engineer Criteria

Technical:

  • Expertise in at least 1 area of SRE work, with general knowledge of all areas.

  • Capable of mentoring Junior team members.

  • Contributes small improvements to the GitLab codebase to resolve issues.

Execution:

  • Identifies projects that result in substantial cost savings or revenue

  • Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.

  • Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.

  • Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.

  • Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

  • Ability to thrive in a fully remote, asynchronous work environment that places a high emphasis on documentation and written communication.

  • Develop expertise in a domain and radiate that knowledge

  • Participate in blameless RCAs on incidents and outages, looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

  • Lead Junior SREs by setting the example.

  • Develop ownership of a major part of the infrastructure.

  • Trusted to de-escalate conflicts inside the team

Performance Indicators

Site Reliability Engineers have the following job-family performance indicators:

  • GitLab.com Availability

  • GitLab.com Performance

  • Apdex and Error SLO per Service

  • Mean Time to Detection

  • Mean Time to Resolution

  • Mean Time Between Failure

  • Mean Time to Production

  • Disaster Recovery Time to Recovery

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Free Job Alerts

Job Skills
API
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2025 Working Nomads.