Site Reliability Engineering Lead – London

Company: SGI
Apply for the Site Reliability Engineering Lead – London
Location: Greater London
Job Description:

Job Description

We’re looking for a true SRE leader with a strong software engineering background. This isn’t a DevOps “on-call only” role — you’ll need to be comfortable reading and writing production code, deeply understanding application behaviour, and working alongside developers as a technical peer.

You’ll lead and mentor the SRE team, setting direction and raising the bar for reliability across our systems. You’ll take end-to-end ownership of production, ensuring availability, performance, and effective incident response, while defining SLIs and partnering with Product on meaningful SLOs and error budgets.

In practice, that means you’ll:

  • Own production systems (availability, performance, incident response)
  • Define SLIs/SLOs and use error budgets to guide decisions
  • Run incident management, on-call, and blameless postmortems
  • Get hands-on with code (PHP, Java/.NET) to troubleshoot and improve reliability
  • Drive automation and reduce operational toil
  • Build observability that gives real insight into system health
  • Partner with engineers to embed reliability into the SDLC

A big part of the role is shaping culture — creating a blameless environment, improving how we respond to incidents, and driving continuous, systemic improvements. You’ll also lead on capacity planning, performance optimisation, and cost efficiency as the platform scales.

We’re looking for someone who brings strong technical leadership, communicates clearly (especially during incidents), and takes real ownership of problems through to resolution. You should be comfortable operating at scale, have deep experience with SLIs/SLOs, incident management, and observability tooling, and be at home working with Linux, databases, cloud platforms (ideally Azure), Kubernetes, and Infrastructure as Code. Just as importantly, you should enjoy tackling complex, imperfect systems — and turning them into something reliable, scalable, and well-understood.

Posted: March 31st, 2026