Role Purpose
Lead Level 3 support for critical digital services, ensuring high availability, fast incident recovery, and long-term resilience. Drive root cause elimination, design supportable architectures, oversee major changes, and mentor support teams. Ensure alignment with DDaT, DevOps, and Home Office service expectations.
Key Outcomes & Responsibilities
- Major Incident Leadership: Act as technical lead for P1/P2 incidents, coordinating recovery and communication.
- Root Cause Ownership: Lead formal RCAs, define corrective actions, and ensure follow-through via sprints/releases.
- Change & Release Governance: Review technical change plans, lead high-risk deployments, and support hotfix releases.
- Availability & Performance: Improve reliability through proactive monitoring, self-healing automation, and architectural enhancements.
- Environment Strategy: Maintain stable non-production environments and collaborate with environment management teams.
- Service Performance: Drive SLA achievement, service reviews, metrics analysis, and proactive improvements.
- Shift Left & Knowledge Management: Develop high-quality runbooks, automate manual tasks, and train L1/L2 teams.
- Transition Support: Provide documentation, KT, and pairing during onboarding/offboarding of support teams.
- Technical Leadership: Mentor engineers and collaborate with product, DevOps, and development teams.
Essential Skills (Must Have)
- Deep expertise in distributed systems, Java, JavaScript, microservices, APIs, and cloud platforms.
- Strong debugging skills using logs, metrics, traces, and profiling tools.
- Experience with CI/CD tooling and release management.
- Strong scripting and automation capabilities.
- Ability to lead technical bridges under pressure.
Desirable Skills (Nice To Have)
- Advanced cloud knowledge (AWS professional level).
- Experience with container orchestration (Kubernetes, ECS, AKS).
- Knowledge of reliability engineering practices (SRE).
- Experience improving infrastructure via IaC.
- Ability to contribute to architecture decisions.
Experience Profile
- 5–10+ years in Level 3 support, DevOps engineering, or SRE roles.
- Significant experience managing critical systems with high availability requirements.
- Proven leadership in major incidents and change governance.
Ways of Working
- Operates within Agile product teams with DevOps principles.
- Leads service reviews, problem boards, and continual improvement cycles.
- Coaches and mentors engineering teams.
Location & Security
UK-based, hybrid working as agreed with Client; SC eligibility is required.
Certification (Preferred)
- AWS/Azure Professional
- SRE or DevOps Practitioner Certifications
#J-18808-Ljbffr…
