Lead Cloud Site Reliability Engineer, Leadership, (Azure or GCP), SLO's, SLO's, Automation
A leading financial Services client is seeking a strong technical leader to help drive and support a large group of SRE engineers across multiple locations. The role will be split 50/50 hands-on, team management. This is an Engineering role, not operations.
The role:
- Lead and mentor a team of up to 15 SREs, championing continuous improvement and engineering excellence.
- Partner with application teams as they migrate services to the Cloud.
- Work with Product Owners and Engineering Leads to balance feature delivery with system reliability, performance and health.
- Use observability tooling, performance metrics and SRE principles to proactively identify issues and reduce operational toil.
- Implement Incident and problem management practices, ensuring strong root cause analysis and reduced MTTF/MTTR.
- Champion SLOs, SLIs, error budgets and reliability‑first thinking.
- Influence platform direction and engineering standards to help shape resilient cloud services at scale.
Technical Skills required:
- Strong team management experience (day-to-day, mentoring/coaching)
- Strong cloud engineering background, ideally across Azure and GCP.
- Experience building or operating large‑scale, resilient cloud platforms.
- Deep understanding of observability tooling (metrics, logs, traces).
- Hands‑on experience with modern SRE practices:
- SLOs / SLIs
- Automation to reduce toil
- Production readiness and robust post‑mortems
- Solid understanding of GitHub pipelines and Terraform modules.
- Proven experience leading high‑performing engineering teams.
- Ability to communicate complex technical topics in a clear, accessible way.
- Comfortable working with diverse stakeholder groups.
#J-18808-Ljbffr