Senior Manager - DevOps

Full-time
USA
$150k-$220k per year
Senior Level
Posted 2 hours ago
Apply for this position → Go ad-free with Premium ×

TrueML Products is seeking a highly experienced and strategic Sr. Manager, DevOps to lead our infrastructure and platform engineering efforts. This role is critical in driving our cloud architecture strategy, establishing elite CI/CD standards, and ensuring the scalability and reliability of our machine learning-driven products.

Reporting to the Sr. Director, Program & Operations, you will lead the evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a hands-on leader with a 'systems-thinking' mindset. We are looking for a visionary who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance and automation.

What You'll Do (Technical Leadership & Strategy):

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.

  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.

  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.

  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions.

  • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.

  • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.

  • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.

  • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.

What You'll Do (Hands-On Engineering & Technical Execution):

  • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations.

  • Hands-on development of Terraform  Infrastructure as Code for resource provisioning.

  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability.

  • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling.

  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR

  • Conduct hands-on audits of cloud configurations and IAM policies, implementing 'least privilege' access controls and automated remediation scripts.

  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow).

What You'll Do (People Leadership & Engineering Collaboration):

  • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning.

  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck.

  • Collaborate with the Quality Engineering and Security leadership to define and enforce 'Definition of Done' standards that include automated testing and security gates.

  • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence.

  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities.

Who You Are (Qualifications):

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.

  • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers

  • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments

  • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment.

  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus. 

  • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.

  • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).

  • Experience acting as an Incident Commander for high-severity outages and fostering a 'blameless' post-mortem culture.

  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams.

  • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery.

Ways to 'Stand Out':

  • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation

  • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success

  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments

  • Notable contributions to Open Source projects or communities

$150,000 - $220,000 a year

Go ad-free with Premium ×
Apply for this position →
About the Job
Full-time
USA
Senior Level
$150k-$220k per year
Posted 2 hours ago
Check if your resume is a good fit
25/100
Get Full Report
+ 1,284 new jobs added today
30,000+
Remote Jobs

Don't miss out — new listings every hour

Join Premium

Senior Manager - DevOps

TrueML Products is seeking a highly experienced and strategic Sr. Manager, DevOps to lead our infrastructure and platform engineering efforts. This role is critical in driving our cloud architecture strategy, establishing elite CI/CD standards, and ensuring the scalability and reliability of our machine learning-driven products.

Reporting to the Sr. Director, Program & Operations, you will lead the evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a hands-on leader with a 'systems-thinking' mindset. We are looking for a visionary who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance and automation.

What You'll Do (Technical Leadership & Strategy):

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.

  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.

  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.

  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions.

  • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.

  • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.

  • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.

  • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.

What You'll Do (Hands-On Engineering & Technical Execution):

  • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations.

  • Hands-on development of Terraform  Infrastructure as Code for resource provisioning.

  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability.

  • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling.

  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR

  • Conduct hands-on audits of cloud configurations and IAM policies, implementing 'least privilege' access controls and automated remediation scripts.

  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow).

What You'll Do (People Leadership & Engineering Collaboration):

  • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning.

  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck.

  • Collaborate with the Quality Engineering and Security leadership to define and enforce 'Definition of Done' standards that include automated testing and security gates.

  • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence.

  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities.

Who You Are (Qualifications):

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.

  • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers

  • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments

  • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment.

  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus. 

  • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.

  • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).

  • Experience acting as an Incident Commander for high-severity outages and fostering a 'blameless' post-mortem culture.

  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams.

  • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery.

Ways to 'Stand Out':

  • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation

  • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success

  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments

  • Notable contributions to Open Source projects or communities

$150,000 - $220,000 a year