Company: Universal Music Group

Apply for the Service Reliability Engineer

Location: Greater London

Job Description:

Role Overview

As a Site Reliability Engineer, you won’t just be supporting systems; you’ll be ensuring the services that connect artists and fans around the globe are always on.

Responsibilities

Design, build, and maintain the availability, scalability, and performance of critical services.
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
Create and maintain scripts and custom code to support and enhance our operational toolset.
Support and optimise CI/CD pipelines to improve deployment speed and reliability.
Participate in an on‑call rotation to troubleshoot and mitigate production incidents.
Lead post‑incident reviews and root cause analyses to implement lasting solutions.
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Act as the final escalation point for SRE operations and coordinate cross‑functional teams during high‑severity events.
Develop and refine escalation frameworks for the Global Technical Operations Centre.
Conduct deep‑dive root cause analysis for recurring, complex problems and develop long‑term solutions.
Mentor and elevate the team, serving as a technical leader and providing training on advanced security concepts, threat landscapes, and best practices.
Collaborate with DevOps and applications architects to enforce standards and promote IaC and toil reduction.
Identify opportunities for network automation, scripting, and tool development to streamline operational tasks.
Create and maintain comprehensive documentation for configurations, SOPs, and incident response protocols.
Communicate effectively with technical and non‑technical stakeholders, including senior management, regarding incident status, resolution plans, and security issues.
Foster a culture of continuous learning and operational excellence within the team.
Work out of standard business hours will occasionally be required.

Qualifications

Strong background in systems administration (Linux/Windows) in a large‑scale environment.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Hands‑on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
Proven analytical and problem‑solving abilities in a high‑pressure environment.
Excellent communication skills and the ability to foster a collaborative team environment.

Preferred Experience & Skills

Bachelor’s degree in an IT‑related field.
Experience managing large‑scale, distributed systems for a global organisation.
Familiarity with IT governance standards like ITIL.
Direct experience with ServiceNow for IT service management.
Knowledge of chaos engineering, resilience testing, and advanced capacity planning.

Everyone is welcome to apply for our roles, and we are determined to ensure that no applicant or employee receives less favourable treatment because of gender, race, disability, sexual orientation, religion, belief, age, marital status, background, pregnancy, or caring responsibilities. We also recognise the importance of diversity of thought within our teams and are fully committed to embracing the talents of people with autism, dyslexia, ADHD, and other forms of neurocognitive variation.

#J-18808-Ljbffr…

Posted: April 21st, 2026

Role Overview

Responsibilities

Qualifications

Preferred Experience & Skills

Latest Job Pages: