MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Lead Site Reliability Engineer

Kontakt.io

Full-time
USA
$190k-$230k per year
engineer
docker
aws
cloud
security
Apply for this position

Kontakt.io is building the platform that care operations run on.

We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.

Easy to deploy and scale, it gives a clear picture of spaces, equipment, and people, eliminating inefficiencies and enhancing the patient experience. With measurable 10X ROI and over 20+ use cases, Kontakt.io is the go-to platform for better and faster care delivery operations.

We’re looking for a Lead Site Reliability Engineer  to own the reliability, performance, and automation of our cloud-based, real-time platform. This role will focus on keeping our platform running smoothly 24/7, minimizing downtime, improving observability, incident response, and self-healing automation. You will lead and scale the SRE team to ensure our infrastructure stays ahead of demand, operates efficiently and meets the needs of our growing healthcare customers.

Responsibilities

  • Ensure 99.99% uptime across our cloud platform, meeting strict SLAs for healthcare customers.

  • Leverage your software engineering expertise to write high-quality, maintainable code that improves system reliability and operational efficiency.

  • Design and implement self-healing, fault-tolerant systems to prevent failures before they happen.

  • Define SLIs, SLOs, and SLAs, ensuring proactive performance monitoring and incident resolution.

  • Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing.

  • Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments.

  • Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management.

  • Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog.

  • Lead incident response and on-call operations, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).

  • Conduct blameless postmortems and continuously improve system resilience.

  • Reduce manual intervention through automated deployment, scaling, and failover mechanisms.

  • Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards

  • Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available.

  • Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering.

  • Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities.

What You Bring

  • 10+ years of experience in Site Reliability Engineering or Cloud Infrastructure.

  • 2+ years of software engineering experience

  • Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare.

  • Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems.

  • Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools.

  • Hands-on experience with incident management, postmortems, and building resilient systems.

  • Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.).

  • A mature leadership approach, with the ability to drive technical strategy while growing and mentoring a high-performance SRE team.

  • Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2).

Bonus Points If You Have:

  • Experience with healthcare IT, including EHR data, FHIR, and HL7 interoperability.

  • Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines.

  • Prior experience leading on-call rotations and major incident management processes.

Why You'll Love It Here

  • Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99% uptime healthcare platform.

  • Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery.

  • Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI.

  • Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies.

  • Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges.

$190,000 - $230,000 a year

Ready to Build the Future of Healthcare?

Apply now and help scale the platform that care operations run on. 🚀

Apply for this position
Bookmark Report

About the job

Full-time
USA
$190k-$230k per year
2 Applicants
Posted 5 days ago
engineer
docker
aws
cloud
security

Apply for this position

Bookmark
Report
Enhancv advertisement

30,000+
REMOTE JOBS

Unlock access to our database and
kickstart your remote career
Join Premium

Lead Site Reliability Engineer

Kontakt.io

Kontakt.io is building the platform that care operations run on.

We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.

Easy to deploy and scale, it gives a clear picture of spaces, equipment, and people, eliminating inefficiencies and enhancing the patient experience. With measurable 10X ROI and over 20+ use cases, Kontakt.io is the go-to platform for better and faster care delivery operations.

We’re looking for a Lead Site Reliability Engineer  to own the reliability, performance, and automation of our cloud-based, real-time platform. This role will focus on keeping our platform running smoothly 24/7, minimizing downtime, improving observability, incident response, and self-healing automation. You will lead and scale the SRE team to ensure our infrastructure stays ahead of demand, operates efficiently and meets the needs of our growing healthcare customers.

Responsibilities

  • Ensure 99.99% uptime across our cloud platform, meeting strict SLAs for healthcare customers.

  • Leverage your software engineering expertise to write high-quality, maintainable code that improves system reliability and operational efficiency.

  • Design and implement self-healing, fault-tolerant systems to prevent failures before they happen.

  • Define SLIs, SLOs, and SLAs, ensuring proactive performance monitoring and incident resolution.

  • Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing.

  • Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments.

  • Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management.

  • Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog.

  • Lead incident response and on-call operations, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).

  • Conduct blameless postmortems and continuously improve system resilience.

  • Reduce manual intervention through automated deployment, scaling, and failover mechanisms.

  • Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards

  • Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available.

  • Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering.

  • Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities.

What You Bring

  • 10+ years of experience in Site Reliability Engineering or Cloud Infrastructure.

  • 2+ years of software engineering experience

  • Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare.

  • Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems.

  • Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools.

  • Hands-on experience with incident management, postmortems, and building resilient systems.

  • Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.).

  • A mature leadership approach, with the ability to drive technical strategy while growing and mentoring a high-performance SRE team.

  • Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2).

Bonus Points If You Have:

  • Experience with healthcare IT, including EHR data, FHIR, and HL7 interoperability.

  • Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines.

  • Prior experience leading on-call rotations and major incident management processes.

Why You'll Love It Here

  • Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99% uptime healthcare platform.

  • Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery.

  • Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI.

  • Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies.

  • Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges.

$190,000 - $230,000 a year

Ready to Build the Future of Healthcare?

Apply now and help scale the platform that care operations run on. 🚀

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Free Job Alerts

Job Skills
API
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2025 Working Nomads.