Drag

SRE (Site Reliability Engineer)

Location : ,

Job Description

Title: Data Center SRE

 

  • Programming and scripting knowledge using python, shell scripting
  • Strong understanding and expertise in working with Linux and Windows OS
  • Strong troubleshooting and debugging skills.
  • Experience on SQL
  • Experience on monitoring and observability stack ex: Zabbix, Katana, elastic search, Prometheus, Grafana
  • Prior experience working on hardware devices like Tegra boards (ex: Jetson etc, will be good)

Tasks

  • Manage machine configuration using configuration management ex: Puppet.
  • Manage machine status/pool/OS using automation / UI.
  • Prepare the targets to be ready for physical migration.
  • Validate the targets post physical migration by submitting tests to individual targets.
  • Debug test failures and identify if related to migration / bad target or ongoing failure on Tot.
  • Enable the targets for production workload.
  • Monitor the job success on targets with production workload.
  • Collaborate and coordinate closely with platform / lab techs for migration.
  • Maintain & update the status of migration and publish on a weekly basis.