Sr. Site Reliability Engineer I
Position Summary:
The Sr. Site Reliability Engineer I is a proactive, disciplined, and collaborative individual with a focus on ensuring the reliability and performance of Pax8 services. They work on enabling teams with observability solutions, supporting the development lifecycle, and maintaining robust cloud infrastructure. The Engineer is involved in developing tools and processes to enhance service stability and availability. They bring a development mindset, excellent debugging skills, and a strong focus on automation and operational excellence. Their goal is to simplify system complexity by standardizing and streamlining technical solutions, maintaining consistency in common patterns, and minimizing the sprawl of redundancy to gain a high level of consistency.
Essential Responsibilities:
Increase developer velocity and system reliability by utilizing software development expertise, collaborating with engineering teams to address reliability concerns, analyzing the sources of issues and the impact on Cloud infrastructure to help the engineering community to work in a reliable, scalable environment (25%)
Standardize and implement baseline visibility across systems. Leverage programmatic monitoring to proactively address visibility gaps. Collaborate with teams to embed observability in the design phase, ensuring resilient and dependable systems. (20%)
Collaborate with Architecture and Platform teams to design automated solutions that eliminate repetitive tasks, enhance self-healing capabilities, improve service reliability, and enable developers to focus on delivering product features using proven, predictable frameworks (15%)
Prioritize security by collaborating with the engineering community to implement secure solutions, address issues proactively and reactively, and use lessons learned to establish best practices that minimize disruptions to product development (15%)
Elevate team capabilities through mentorship, project work assistance, design guidance, and participation in support and on-call rotations (15%)
Participate in incident response and post-incident analysis to drive improvements in system reliability by contributing to rapid recovery, conducting root cause analysis, and implementing changes based on post-mortem findings. (10%)
Ideal Skills, Experience, and Competencies:
At least five (5) to eight (8) years of experience developing and supporting application development, preferably Kotlin based microservices
Healthy attitude and prioritizes positive interactions that motivate and inspire creativity
Substantial, proven, software development experience
Solid Site Reliability Engineering experience
Ability to show advanced proficiency in a relevant programming language (Kotlin, Groovy, Python)
Advanced experience with one or more of the following frameworks (Spring Boot, Spring, Kafka, ElasticSearch, RDS, OAuth)
Experience using AI within the SDLC to quickly deliver reliable solutions
Strong experience with observability platforms, such as Sumologic, Honeycomb, and similar tools to track performance and detect issues
Solid understanding of core AWS services, including EKS, RDS, and MSK (Azure knowledge is a plus)
Extensive experience with container technologies such as Docker and Kubernetes, with an emphasis on operational reliability
Proven experience in debugging and troubleshooting applications
Database and SQL development experience
Understanding of IaC and configuration management using Terraform and Git
Understanding of CI/CD pipelines using GitHub Actions and ArgoCD
Experience working in a Lean/Agile environment
Focus on meeting project commitments with predictability and urgency
Strong presentation, written, and verbal communication skills
Strong desire for automation
Ability to build strong customer relationships and deliver customer-centric solutions
Ability to take on new opportunities and tough challenges with a sense of urgency, high energy, and enthusiasm
Ability to gain the confidence and trust of others through honesty, integrity, and authenticity
Ability to maneuver comfortably through complex policy, process, and people-related organizational dynamics
Ability to anticipate and adopt innovations in business-building digital and technology applications
Required Education & Certifications:
B.A./B.S. in related field or equivalent work experience
Compensation:
Qualified candidates can expect a compensation range of $125,000 to $180,000 or more depending on experience.
Expected Closing Date: 6/15/2025
#LI-Remote #LI-DS1
#LI-
About the job
Apply for this position
Sr. Site Reliability Engineer I
Position Summary:
The Sr. Site Reliability Engineer I is a proactive, disciplined, and collaborative individual with a focus on ensuring the reliability and performance of Pax8 services. They work on enabling teams with observability solutions, supporting the development lifecycle, and maintaining robust cloud infrastructure. The Engineer is involved in developing tools and processes to enhance service stability and availability. They bring a development mindset, excellent debugging skills, and a strong focus on automation and operational excellence. Their goal is to simplify system complexity by standardizing and streamlining technical solutions, maintaining consistency in common patterns, and minimizing the sprawl of redundancy to gain a high level of consistency.
Essential Responsibilities:
Increase developer velocity and system reliability by utilizing software development expertise, collaborating with engineering teams to address reliability concerns, analyzing the sources of issues and the impact on Cloud infrastructure to help the engineering community to work in a reliable, scalable environment (25%)
Standardize and implement baseline visibility across systems. Leverage programmatic monitoring to proactively address visibility gaps. Collaborate with teams to embed observability in the design phase, ensuring resilient and dependable systems. (20%)
Collaborate with Architecture and Platform teams to design automated solutions that eliminate repetitive tasks, enhance self-healing capabilities, improve service reliability, and enable developers to focus on delivering product features using proven, predictable frameworks (15%)
Prioritize security by collaborating with the engineering community to implement secure solutions, address issues proactively and reactively, and use lessons learned to establish best practices that minimize disruptions to product development (15%)
Elevate team capabilities through mentorship, project work assistance, design guidance, and participation in support and on-call rotations (15%)
Participate in incident response and post-incident analysis to drive improvements in system reliability by contributing to rapid recovery, conducting root cause analysis, and implementing changes based on post-mortem findings. (10%)
Ideal Skills, Experience, and Competencies:
At least five (5) to eight (8) years of experience developing and supporting application development, preferably Kotlin based microservices
Healthy attitude and prioritizes positive interactions that motivate and inspire creativity
Substantial, proven, software development experience
Solid Site Reliability Engineering experience
Ability to show advanced proficiency in a relevant programming language (Kotlin, Groovy, Python)
Advanced experience with one or more of the following frameworks (Spring Boot, Spring, Kafka, ElasticSearch, RDS, OAuth)
Experience using AI within the SDLC to quickly deliver reliable solutions
Strong experience with observability platforms, such as Sumologic, Honeycomb, and similar tools to track performance and detect issues
Solid understanding of core AWS services, including EKS, RDS, and MSK (Azure knowledge is a plus)
Extensive experience with container technologies such as Docker and Kubernetes, with an emphasis on operational reliability
Proven experience in debugging and troubleshooting applications
Database and SQL development experience
Understanding of IaC and configuration management using Terraform and Git
Understanding of CI/CD pipelines using GitHub Actions and ArgoCD
Experience working in a Lean/Agile environment
Focus on meeting project commitments with predictability and urgency
Strong presentation, written, and verbal communication skills
Strong desire for automation
Ability to build strong customer relationships and deliver customer-centric solutions
Ability to take on new opportunities and tough challenges with a sense of urgency, high energy, and enthusiasm
Ability to gain the confidence and trust of others through honesty, integrity, and authenticity
Ability to maneuver comfortably through complex policy, process, and people-related organizational dynamics
Ability to anticipate and adopt innovations in business-building digital and technology applications
Required Education & Certifications:
B.A./B.S. in related field or equivalent work experience
Compensation:
Qualified candidates can expect a compensation range of $125,000 to $180,000 or more depending on experience.
Expected Closing Date: 6/15/2025
#LI-Remote #LI-DS1
#LI-