Sr Site Reliability Engineer

Knack

Full-time

USA

devops

javascript

nodejs

golang

customer experience

The job listing has expired. Unfortunately, the hiring company is no longer accepting new applications.

To see similar active jobs please follow this link: Remote System Administration jobs

Headquarters: US
URL: https://knack.com

We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.

Key Responsibilities

Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
Serve as technical lead for deep dives to identify solutions to prevent future incidents
Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability

Skills Knowledge and Expertise

Expertise in AWS
Expertise with RDS, preferably Aurora PostgreSQL engine
Expertise with containerization
Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals

Our Stack

Our stack is evolving over the next year and we’d love you to be a part of that!
Currently we’re using:

Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
Data: Aurora PostgreSQL, Redis, ElasticSearch
DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
Testing: Playwright, Mocha, Jest
Front-end: Vue.js, Webpack, SCSS

Sr Site Reliability Engineer

Knack

The job listing has expired. Unfortunately, the hiring company is no longer accepting new applications.

To see similar active jobs please follow this link: Remote System Administration jobs

Headquarters: US
URL: https://knack.com

We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.

Key Responsibilities

Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
Serve as technical lead for deep dives to identify solutions to prevent future incidents
Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability

Skills Knowledge and Expertise

Expertise in AWS
Expertise with RDS, preferably Aurora PostgreSQL engine
Expertise with containerization
Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals

Our Stack

Our stack is evolving over the next year and we’d love you to be a part of that!
Currently we’re using:

Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
Data: Aurora PostgreSQL, Redis, ElasticSearch
DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
Testing: Playwright, Mocha, Jest
Front-end: Vue.js, Webpack, SCSS

About the job

30,000+
REMOTE JOBS

Sr Site Reliability Engineer

Working Nomads

Jobs by Category

Jobs by Position Type

Jobs by Region

Jobs by Skill

Jobs by Country