As a Site Reliability Engineering Lead, you’ll lead and partner with cross-functional teams to keep our platforms reliable, resilient, secure, and continuously improving – if you’re passionate about operational excellence and helping others succeed, we’d love to hear from you., In this role, you will lead a team of Site Reliability Engineers focused on improving the reliability, resilience, and operational readiness of the platforms your team supports. You’ll partner closely with engineering, product, and security teams to reduce operational risk, strengthen incident response, and drive meaningful automation that improves service health and customer outcomes.,
- Lead and develop a team of SREs – set direction, manage conflicting priorities and trade-offs, remove blockers, support wellbeing on-call, and keep work focused on the highest reliability risks and opportunities.
- People management: hire and onboard talent, provide regular coaching and feedback, support career development, and contribute to performance and progression processes.
- Own service reliability for the platforms your team supports: define and evolve operation metrics, uphold standards for observability, monitoring, alerting, and operational readiness.
- Work closely with Security and Engineering to embed secure-by-default operations (e.g., patching, access controls, secrets management) and support audit and compliance needs.
- Participate in the on-call rota (including escalation/incident leadership as needed) and continuously improve runbooks, alerts, and operational readiness.
- Act as a senior escalation point during incidents, providing calm, structured coordination to restore service quickly and safely, and ensuring clear stakeholder communications.
- Lead blameless post-incident reviews and Root Cause Analyses (RCAs), ensuring actions are prioritised, tracked, and shared across teams.
- Partner with product and engineering teams to design for resilience, capacity, and recovery – systems that fail gracefully, recover quickly, and meet customer reliability expectations; drive automation and reduce toil by improving platform tooling, CI/CD, standards, and self-service capabilities. Experience leading, mentoring, or managing engineers.
- Strong grasp of SRE/platform engineering practices, including Infrastructure-as-code, observability, incident management, on-call operations, and post-incident reviews.
- Confidence working with cloud platforms, and a pragmatic approach to automation and reducing operational toil.
- Clear, structured communication with both engineers and stakeholders, especially when handling operational risk or coordinating incidents.
- A collaborative, learning-focused approach that builds psychological safety and values curiosity over blame.
#J-18808-Ljbffr…
