Overview
Location: London (hybrid working)
The Role: This is a specialist resilience engineering role focused on ensuring the organisation can effectively respond to and recover from major technology disruptions — including cyber incidents, infrastructure failures, cloud outages, and data integrity issues.
You’ll be responsible for evolving Business Continuity (BC) and Disaster Recovery (DR) from largely document-driven processes into robust, engineered, and continuously tested capabilities.
Working closely with Technology and Security teams, you will design and implement recovery strategies that ensure critical services can be restored within defined recovery objectives, minimising operational and business impact.
Key Responsibilities
Business Continuity & Disaster Recovery
- Own and maintain the organisation’s BC/DR strategy across infrastructure and critical services
- Develop and continuously improve recovery plans aligned to business priorities
- Drive the shift from static documentation to practical, testable recovery capabilities
Resilient Architecture & Engineering
- Design and support resilient solutions across cloud and on-prem environments
- Embed recovery, redundancy, and fault tolerance into system design
- Implement backup, replication, and failover strategies to enable rapid service restoration
Risk & Recovery Management
- Define and manage Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
- Validate backup integrity, restoration processes, and recovery sequencing
- Identify gaps in resilience and drive remediation initiatives
Testing & Simulation
- Plan and execute disaster recovery testing, failover exercises, and scenario simulations
- Lead large-scale outage and cyber recovery exercises with technical teams
- Capture insights and continuously improve recovery readiness
- Act as a technical SME during major incidents and recovery events
- Contribute to post-incident reviews and ensure corrective actions are implemented
- Partner with Security teams to align DR with cyber incident response
- Work with technical and business stakeholders to embed resilience requirements
- Ensure recovery considerations are integrated into architecture and change processes
Documentation & Capability Development
- Develop clear recovery playbooks for key failure scenarios
- Maintain structured, actionable documentation for recovery processes
- Build internal capability and reduce reliance on external support
- 5+ years in infrastructure, cloud, or platform engineering
- Proven experience delivering Disaster Recovery and Business Continuity solutions
- Strong understanding of cyber-related disruption scenarios (e.g. ransomware, identity compromise)
- Experience supporting incident response and recovery activities
- Deep knowledge of backup, replication, high availability, and failover design
- Experience defining and managing RTO and RPO in enterprise environments
- Strong analytical and problem-solving skills
- Ability to remain calm and structured during major incidents
- Excellent communication and documentation skills
- Cloud, cybersecurity, or resilience certifications
- Experience in complex, multi-system enterprise environments
- Exposure to structured DR testing or cyber simulation exercises
- Background in high-availability or mission-critical systems
Overview
Location: London (hybrid working)
The Role: This is a specialist resilience engineering role focused on ensuring the organisation can effectively respond to and recover from major technology disruptions — including cyber incidents, infrastructure failures, cloud outages, and data integrity issues.
You’ll be responsible for evolving Business Continuity (BC) and Disaster Recovery (DR) from largely document-driven processes into robust, engineered, and continuously tested capabilities.
Working closely with Technology and Security teams, you will design and implement recovery strategies that ensure critical services can be restored within defined recovery objectives, minimising operational and business impact.
Key Responsibilities
Business Continuity & Disaster Recovery
- Own and maintain the organisation’s BC/DR strategy across infrastructure and critical services
- Develop and continuously improve recovery plans aligned to business priorities
- Drive the shift from static documentation to practical, testable recovery capabilities
Resilient Architecture & Engineering
- Design and support resilient solutions across cloud and on-prem environments
- Embed recovery, redundancy, and fault tolerance into system design
- Implement backup, replication, and failover strategies to enable rapid service restoration
Risk & Recovery Management
- Define and manage Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
- Validate backup integrity, restoration processes, and recovery sequencing
- Identify gaps in resilience and drive remediation initiatives
Testing & Simulation
- Plan and execute disaster recovery testing, failover exercises, and scenario simulations
- Lead large-scale outage and cyber recovery exercises with technical teams
- Capture insights and continuously improve recovery readiness
- Act as a technical SME during major incidents and recovery events
- Contribute to post-incident reviews and ensure corrective actions are implemented
- Partner with Security teams to align DR with cyber incident response
- Work with technical and business stakeholders to embed resilience requirements
- Ensure recovery considerations are integrated into architecture and change processes
Documentation & Capability Development
- Develop clear recovery playbooks for key failure scenarios
- Maintain structured, actionable documentation for recovery processes
- Build internal capability and reduce reliance on external support
- 5+ years in infrastructure, cloud, or platform engineering
- Proven experience delivering Disaster Recovery and Business Continuity solutions
- Strong understanding of cyber-related disruption scenarios (e.g. ransomware, identity compromise)
- Experience supporting incident response and recovery activities
- Deep knowledge of backup, replication, high availability, and failover design
- Experience defining and managing RTO and RPO in enterprise environments
- Strong analytical and problem-solving skills
- Ability to remain calm and structured during major incidents
- Excellent communication and documentation skills
- Cloud, cybersecurity, or resilience certifications
- Experience in complex, multi-system enterprise environments
- Exposure to structured DR testing or cyber simulation exercises
- Background in high-availability or mission-critical systems
#J-18808-Ljbffr…
