Senior Site Reliability Engineer II
To see similar active jobs please follow this link: Remote System Administration jobs
General Summary:
Site Reliability Engineers enhance the reliability, scalability, and performance of production systems across the organization. They bridge the gap between development and operations teams, collaborating with software engineers, product managers, and technical stakeholders to design and implement robust infrastructure solutions that support business-critical applications and services. SREs apply software engineering principles to operations problems, creating automated solutions to ensure system availability, latency, performance, and capacity
Essential Duties and Responsibilities:
Design, build, implement, maintain, and support system infrastructure.
Define and improve software delivery, system configuration, security, performance, and operational mechanisms of varied Cloud infrastructures in use by different projects and company efforts.
Identify impact, present options, plan delivery activities, mitigate downtime risks, recommend strategies, and estimate level of effort for creating new or modifying existing Cloud infrastructures for projects.
Develop and leverage automation and monitoring capabilities for complex cloud-based solutions.
Consistently apply and enforce Cloud Engineering standards and best practices.
Assist in creating test, demonstration, and proof-of-concept environments.
Find and recommend technical improvements and cost-saving measures.
Participate in troubleshooting efforts and incident resolution activities
Keep stakeholders abreast of the status of their requests.
Write technical documentation and keep process-related information current
Knowledge, Skills, and/or Abilities Required:
5+ years of hands-on experience managing Windows servers, including DNS, networking, and IIS
5+ years of experience programming in Python, PowerShell, and Bash.
3+ years of experience working with AWS offerings. Proficient in the use of EC2/ECS, RDS, S3, Route53, SWF, ELB, VPC networking, Redis, and OpenSearch.
Ability to write basic SQL queries and analyze data for troubleshooting purposes.
Practical experience building deployment pipelines on Gitlab CI.
Practical experience using observability platforms like Dynatrace, DataDog, and CloudWatch
Practical experience using .NET command line debugger mdbg or similar.
Knowledge of package-management tools like Artifactory.
Solid understanding of the DevOps mindset and the Infrastructure as Code (IaC) philosophy using CloudFormation or Terraform.
Possess strong analytical and problem-solving skills to resolve or coordinate the resolution of complex Cloud infrastructure issues in a flexible and effective manner.
On call incident management with PagerDuty or similar tools.
Practical experience working in Agile, distributed environments.
Ability to work on multiple priorities and/or projects simultaneously, independently or with a team.
Demonstrate ability to learn quickly.
Excellent verbal and written communication skills.
Educational/Vocational/Previous Experience Recommendations:
5+ years of experience in Information Technology related areas.
Bachelor’s degree preferred.
AWS certification preferred.
Working Conditions:
Professional, fast-paced remote/office environment.
Availability during off hours to assist in the resolution of production incidents.
Some travel may be required.
Senior Site Reliability Engineer II
To see similar active jobs please follow this link: Remote System Administration jobs
General Summary:
Site Reliability Engineers enhance the reliability, scalability, and performance of production systems across the organization. They bridge the gap between development and operations teams, collaborating with software engineers, product managers, and technical stakeholders to design and implement robust infrastructure solutions that support business-critical applications and services. SREs apply software engineering principles to operations problems, creating automated solutions to ensure system availability, latency, performance, and capacity
Essential Duties and Responsibilities:
Design, build, implement, maintain, and support system infrastructure.
Define and improve software delivery, system configuration, security, performance, and operational mechanisms of varied Cloud infrastructures in use by different projects and company efforts.
Identify impact, present options, plan delivery activities, mitigate downtime risks, recommend strategies, and estimate level of effort for creating new or modifying existing Cloud infrastructures for projects.
Develop and leverage automation and monitoring capabilities for complex cloud-based solutions.
Consistently apply and enforce Cloud Engineering standards and best practices.
Assist in creating test, demonstration, and proof-of-concept environments.
Find and recommend technical improvements and cost-saving measures.
Participate in troubleshooting efforts and incident resolution activities
Keep stakeholders abreast of the status of their requests.
Write technical documentation and keep process-related information current
Knowledge, Skills, and/or Abilities Required:
5+ years of hands-on experience managing Windows servers, including DNS, networking, and IIS
5+ years of experience programming in Python, PowerShell, and Bash.
3+ years of experience working with AWS offerings. Proficient in the use of EC2/ECS, RDS, S3, Route53, SWF, ELB, VPC networking, Redis, and OpenSearch.
Ability to write basic SQL queries and analyze data for troubleshooting purposes.
Practical experience building deployment pipelines on Gitlab CI.
Practical experience using observability platforms like Dynatrace, DataDog, and CloudWatch
Practical experience using .NET command line debugger mdbg or similar.
Knowledge of package-management tools like Artifactory.
Solid understanding of the DevOps mindset and the Infrastructure as Code (IaC) philosophy using CloudFormation or Terraform.
Possess strong analytical and problem-solving skills to resolve or coordinate the resolution of complex Cloud infrastructure issues in a flexible and effective manner.
On call incident management with PagerDuty or similar tools.
Practical experience working in Agile, distributed environments.
Ability to work on multiple priorities and/or projects simultaneously, independently or with a team.
Demonstrate ability to learn quickly.
Excellent verbal and written communication skills.
Educational/Vocational/Previous Experience Recommendations:
5+ years of experience in Information Technology related areas.
Bachelor’s degree preferred.
AWS certification preferred.
Working Conditions:
Professional, fast-paced remote/office environment.
Availability during off hours to assist in the resolution of production incidents.
Some travel may be required.