Drag

SRE (Site Reliability Engineer)

Location : ,

Job Description

Site Reliability Engineer

The Site Reliability Engineer (SRE) is a technician who utilizes an array of skills to enhance reliability in critical customer facing digital assets. The SRE is responsible for maintaining the availability and performance of relevant systems through supporting, building, and enhancing applications, tools and engaging with infrastructure teams. The SRE designs and configures systems to monitor and alert on critical applications and automate issue resolution. This role includes a focus on providing solutions that are robust, scalable, and highly available. For internal processes and technologies, the SRE will build systems to streamline operations and reduce friction. The SRE will be part of an on-call rotation to troubleshoot production issues with the specific goal of building resilient mitigation processes.                   

 

Experience with deployments and operations of 24x7 high volume, highly available systems.  Cloud scaling and Ability to drive automation/modernization initiatives.  Enjoy working with a large variety of services and new technologies.  Demonstrate a solid understanding of development, debugging, administration, and automation frameworks: C#/.NET, PowerShell, Python, Ansible, etc.  Experience with logging platforms and application performance metrics: DataDog, NewRelic, Splunk, ELK, Dyantrace, App Insights Analytics, etc.    

 

Certification(s) specific to Architecture discipline 

5+ years of experience working with technical teams.  Strong emphasis on SRE as an engineering discipline with a focus on automation.         

 

You can write detailed solution specifications, diagrams, best practices/standards documentation, operating procedures, test plans/test reports, etc.  Experience supporting infrastructure and services in public cloud environments (AWS, GCP, etc.).  Experience building and supporting containerized application technologies, including Docker, Kubernetes.  Experience with public cloud cost management. Experience in performance engineering and capacity planning.  Prior success in automating a real-world production environment. Knowledge of IP networking, VPN's, DNS, load balancing and firewall. Expertise in any monitoring tools like Splunk, AppDynamics, Nagios, New Relic. Experience with software development and testing process in an agile environment   

 

  • Applications written in .NET (python or any other scripting would be good) we need more of a dev background then operations.
  • Automation experience: Ansible preferred but good with Terraform as well.
  • Doesn’t need to come from a 24x7 environment but needs to be okay working in that environment.
  • AWS preferred but any cloud experience is good.
  • Must have experience with Kubernetes & Docker