Site Reliability Engineering Manager | London (2 days Hybrid)
We’re partnering with one of the UK’s most recognised and high‑traffic consumer tech platforms to find an Engineering Manager to lead their Site Reliability function.
The Role
This is a blended people leadership and technical role, responsible for operational excellence, observability, and reliability at scale across a platform that serves millions of users. You’ll own incident management processes, drive reliability engineering standards, and ensure the business maintains its exceptionally high availability targets.
Key Responsibilities
- Own monitoring, alerting and observability strategy, ensuring product teams have high reliability confidence and fast incident detection and resolution
- Lead and standardise incident management processes, maintaining a culture of accountability, transparency and continuous learning
- Define reliability patterns and standards to reduce cascading failures across distributed systems
- Own and manage the reliability roadmap, OKR delivery and alignment with wider business goals
- Lead, develop and grow a team of engineers — setting objectives, growth plans and fostering a psychologically safe, inclusive environment.
What You’ll Need
- Proven experience in SRE management across production environments — observability, monitoring and service delivery
- Strong understanding of reliability in distributed microservices and cloud‑based architectures
- Experience with modern SRE tooling, incident management workflows and SLO/SLI frameworks
- Familiarity with platform engineering concepts and reducing friction for product teams
- Strong leadership, communication and stakeholder management skills
#J-18808-Ljbffr…
