Location : ,
Role: Site Reliability Engineer
Job Summary:
The SRE is responsible for maintaining the availability and performance of relevant systems through supporting, building, and enhancing applications, tools and engaging with infrastructure teams.
The SRE designs and configures systems to monitor and alert on critical applications and automate issue resolution.
This role includes a focus on providing solutions that are robust, scalable, and highly available. For internal processes and technologies, the SRE will build systems to streamline operations and reduce friction.
The SRE will be part of an on-call rotation to troubleshoot production issues with the specific goal of building resilient mitigation processes.
Experience with deployments and operations of 24x7 high volume, highly available systems.
Demonstrate a solid understanding of development, debugging, administration, and automation frameworks: C#/.NET, PowerShell, Python, Ansible, etc.
Experience with logging platforms and application performance metrics: DataDog, New Relic, Splunk, ELK, Dyantrace, App Insights Analytics, etc.
You can write detailed solution specifications, diagrams, best practices/standards documentation, operating procedures, test plans/test reports, etc.
Experience supporting infrastructure and services in public cloud environments (AWS, GCP, etc.).
Experience building and supporting containerized application technologies, including Docker, Kubernetes.
Experience with public cloud cost management. Experience in performance engineering and capacity planning. Prior success in automating a real-world production environment.
Knowledge of IP networking, VPN's, DNS, load balancing and firewall.
Expertise in any monitoring tools like Splunk, AppDynamics, Nagios, New Relic. Experience with software development and testing process in an agile environment
Certification(s) specific to Architecture discipline