Site Reliability Engineering Manager

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Site Reliability Engineering Manager”, “description”: “

Requirements

  • This role demands a proactive and hands‑on leader with deep technical expertise and strong critical thinking
  • Degree educated or equivalent work experience
  • Number of years in Production Support / SRE roles with at least 3 years in a leadership capacity
  • Deep technical expertise in Oracle database – troubleshooting, scalability, performance tuning and optimization
  • Demonstrated experience implementing SRE frameworks – including SLOs, SLIs, incident management, and chaos engineering
  • Experience leading teams supporting systems deployed across mixed infrastructure (Cloud and On‑Premise, AWS preferred)
  • Solid understanding of change management, risk posture, and production readiness
  • Strong track record of delivering automation at scale, reducing toil, and eliminating manual operational tasks
  • Excellent communication and stakeholder management skills, particularly under pressure
  • Expertise in automation (Python, Shell, PowerShell etc.)
  • Familiarity with observability tools and practices (metrics, logging, tracing)
  • Ability to lead capacity planning and scalability strategies to support growth
  • Knowledge of clearing and settlement processes in financial markets
  • Familiarity with regulatory requirements and governance frameworks in financial services
  • Demonstrated ability to build, mentor, and retain high‑performing SRE teams
  • Good communication and stakeholder management skills under pressure
  • Demonstrable experience managing SRE or Production Support teams in a critically important financial services environment
  • Experience managing teams located across multiple locations and time zones
  • Excellent analytical skills, Attention to detail and problem‑solving abilities
  • Solid technical background in the core technologies with several years of experience
  • Ability to communicate clearly and concisely to IT and business teams and to senior management
  • Ability to break down complex technical issues into easy to digest format
  • Familiarity with financial products and terminology

What the job involves

  • We are looking for a Manager – Site Reliability Engineering to strengthen the Production Management leadership team of Clearing Technology Service
  • You will be responsible for ensuring stability, resilience, and performance of our production systems while driving continuous improvement and SRE best practices across the platform
  • Assume end‑to‑end accountability for Clearing production environment, ensuring high availability, optimal performance, and robust resilience of business‑critical systems
  • Act as Incident Commander during major incidents, leading resolution efforts, managing stakeholder communications, and driving root cause analysis and remediation
  • Build and mentor a high‑performing SRE team. Promote a culture of accountability, continuous improvement, and blameless postmortems to enhance operational excellence
  • Ensure consistency to response and resolution SLAs. Oversee efficient ticket management and escalation processes through ServiceNow, removing blockers promptly
  • Develop strong partnerships across LCH and LSEG teams. Ensure timely delivery of business‑critical activities and transparent communication of risks and challenges
  • Monitor and analyse technical processes to identify improvement opportunities. Implement enhancements to minimise business disruption and improve operational efficiency
  • Ensure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks
  • Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data‑driven decisions and improve system health
  • Out of hours support / On‑call support
  • Be available for overnight support of production services to ensure successful completion of processing
  • Respond to overnight calls and deal with issues
  • Participate in Disaster Recovery exercises

#J-18808-Ljbffr”, “datePosted”: “2026-05-18”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Deepstreamtech”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__435635178__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }
Company: Deepstreamtech
Apply for the Site Reliability Engineering Manager
Location: London
Job Description:

Requirements

  • This role demands a proactive and hands‑on leader with deep technical expertise and strong critical thinking
  • Degree educated or equivalent work experience
  • Number of years in Production Support / SRE roles with at least 3 years in a leadership capacity
  • Deep technical expertise in Oracle database – troubleshooting, scalability, performance tuning and optimization
  • Demonstrated experience implementing SRE frameworks – including SLOs, SLIs, incident management, and chaos engineering
  • Experience leading teams supporting systems deployed across mixed infrastructure (Cloud and On‑Premise, AWS preferred)
  • Solid understanding of change management, risk posture, and production readiness
  • Strong track record of delivering automation at scale, reducing toil, and eliminating manual operational tasks
  • Excellent communication and stakeholder management skills, particularly under pressure
  • Expertise in automation (Python, Shell, PowerShell etc.)
  • Familiarity with observability tools and practices (metrics, logging, tracing)
  • Ability to lead capacity planning and scalability strategies to support growth
  • Knowledge of clearing and settlement processes in financial markets
  • Familiarity with regulatory requirements and governance frameworks in financial services
  • Demonstrated ability to build, mentor, and retain high‑performing SRE teams
  • Good communication and stakeholder management skills under pressure
  • Demonstrable experience managing SRE or Production Support teams in a critically important financial services environment
  • Experience managing teams located across multiple locations and time zones
  • Excellent analytical skills, Attention to detail and problem‑solving abilities
  • Solid technical background in the core technologies with several years of experience
  • Ability to communicate clearly and concisely to IT and business teams and to senior management
  • Ability to break down complex technical issues into easy to digest format
  • Familiarity with financial products and terminology

What the job involves

  • We are looking for a Manager – Site Reliability Engineering to strengthen the Production Management leadership team of Clearing Technology Service
  • You will be responsible for ensuring stability, resilience, and performance of our production systems while driving continuous improvement and SRE best practices across the platform
  • Assume end‑to‑end accountability for Clearing production environment, ensuring high availability, optimal performance, and robust resilience of business‑critical systems
  • Act as Incident Commander during major incidents, leading resolution efforts, managing stakeholder communications, and driving root cause analysis and remediation
  • Build and mentor a high‑performing SRE team. Promote a culture of accountability, continuous improvement, and blameless postmortems to enhance operational excellence
  • Ensure consistency to response and resolution SLAs. Oversee efficient ticket management and escalation processes through ServiceNow, removing blockers promptly
  • Develop strong partnerships across LCH and LSEG teams. Ensure timely delivery of business‑critical activities and transparent communication of risks and challenges
  • Monitor and analyse technical processes to identify improvement opportunities. Implement enhancements to minimise business disruption and improve operational efficiency
  • Ensure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks
  • Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data‑driven decisions and improve system health
  • Out of hours support / On‑call support
  • Be available for overnight support of production services to ensure successful completion of processing
  • Respond to overnight calls and deal with issues
  • Participate in Disaster Recovery exercises

#J-18808-Ljbffr…

Posted: May 18th, 2026