Requirements
- A strong background in systems administration (Linux/Windows) in a large-scale environment
- Proficiency in at least one programming language (e.g., Python, Go, Java)
- Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
- Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
- Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
- Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
- Excellent communication skills and the ability to foster a collaborative team environment
- (Desirable) Bachelor's degree in an IT‑related field
- (Desirable) Experience managing large‑scale, distributed systems for a global organization
- (Desirable) Familiarity with IT governance standards like ITIL
- (Desirable) Direct experience with ServiceNow for IT service management
- (Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning
What the job involves
- As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on
- System Reliability & Performance:
- Design, build, and maintain the availability, scalability, and performance of critical services
- Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
- Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
- Automation & Efficiency:
- Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
- Create and maintain scripts and custom code to support and enhance our operational toolset
- Support and optimize CI/CD pipelines to improve deployment speed and reliability
- Incident Management & Collaboration:
- Participate in an on‑call rotation to troubleshoot and mitigate production incidents
- Lead post‑incident reviews and root cause analyses to implement lasting solutions
- Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle
Requirements
- A strong background in systems administration (Linux/Windows) in a large-scale environment
- Proficiency in at least one programming language (e.g., Python, Go, Java)
- Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
- Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
- Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
- Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
- Excellent communication skills and the ability to foster a collaborative team environment
- (Desirable) Bachelor’s degree in an IT‑related field
- (Desirable) Experience managing large‑scale, distributed systems for a global organization
- (Desirable) Familiarity with IT governance standards like ITIL
- (Desirable) Direct experience with ServiceNow for IT service management
- (Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning
What the job involves
- As a Site Reliability Engineer, you won’t just be supporting systems; you’ll be ensuring the services that connect artists and fans around the globe are always on
- System Reliability & Performance:
- Design, build, and maintain the availability, scalability, and performance of critical services
- Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
- Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
- Automation & Efficiency:
- Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
- Create and maintain scripts and custom code to support and enhance our operational toolset
- Support and optimize CI/CD pipelines to improve deployment speed and reliability
- Incident Management & Collaboration:
- Participate in an on‑call rotation to troubleshoot and mitigate production incidents
- Lead post‑incident reviews and root cause analyses to implement lasting solutions
- Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle
#J-18808-Ljbffr…
