Site Reliability Engineering Manager

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Site Reliability Engineering Manager”, “description”: “

Requirements

This role demands a proactive and hands‑on leader with deep technical expertise and strong critical thinking
Degree educated or equivalent work experience
Number of years in Production Support / SRE roles with at least 3 years in a leadership capacity
Deep technical expertise in Oracle database – troubleshooting, scalability, performance tuning and optimization
Demonstrated experience implementing SRE frameworks – including SLOs, SLIs, incident management, and chaos engineering
Experience leading teams supporting systems deployed across mixed infrastructure (Cloud and On‑Premise, AWS preferred)
Solid understanding of change management, risk posture, and production readiness
Strong track record of delivering automation at scale, reducing toil, and eliminating manual operational tasks
Excellent communication and stakeholder management skills, particularly under pressure
Expertise in automation (Python, Shell, PowerShell etc.)
Familiarity with observability tools and practices (metrics, logging, tracing)
Ability to lead capacity planning and scalability strategies to support growth
Knowledge of clearing and settlement processes in financial markets
Familiarity with regulatory requirements and governance frameworks in financial services
Demonstrated ability to build, mentor, and retain high‑performing SRE teams
Good communication and stakeholder management skills under pressure
Demonstrable experience managing SRE or Production Support teams in a critically important financial services environment
Experience managing teams located across multiple locations and time zones
Excellent analytical skills, Attention to detail and problem‑solving abilities
Solid technical background in the core technologies with several years of experience
Ability to communicate clearly and concisely to IT and business teams and to senior management
Ability to break down complex technical issues into easy to digest format
Familiarity with financial products and terminology

What the job involves

We are looking for a Manager – Site Reliability Engineering to strengthen the Production Management leadership team of Clearing Technology Service
You will be responsible for ensuring stability, resilience, and performance of our production systems while driving continuous improvement and SRE best practices across the platform
Assume end‑to‑end accountability for Clearing production environment, ensuring high availability, optimal performance, and robust resilience of business‑critical systems
Act as Incident Commander during major incidents, leading resolution efforts, managing stakeholder communications, and driving root cause analysis and remediation
Build and mentor a high‑performing SRE team. Promote a culture of accountability, continuous improvement, and blameless postmortems to enhance operational excellence
Ensure consistency to response and resolution SLAs. Oversee efficient ticket management and escalation processes through ServiceNow, removing blockers promptly
Develop strong partnerships across LCH and LSEG teams. Ensure timely delivery of business‑critical activities and transparent communication of risks and challenges
Monitor and analyse technical processes to identify improvement opportunities. Implement enhancements to minimise business disruption and improve operational efficiency
Ensure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks
Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data‑driven decisions and improve system health
Out of hours support / On‑call support
Be available for overnight support of production services to ensure successful completion of processing
Respond to overnight calls and deal with issues
Participate in Disaster Recovery exercises

#J-18808-Ljbffr”, “datePosted”: “2026-05-18”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Deepstreamtech”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__435635178__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }

Company: Deepstreamtech

Apply for the Site Reliability Engineering Manager

Location: London

Job Description:

Requirements

This role demands a proactive and hands‑on leader with deep technical expertise and strong critical thinking
Degree educated or equivalent work experience
Number of years in Production Support / SRE roles with at least 3 years in a leadership capacity
Deep technical expertise in Oracle database – troubleshooting, scalability, performance tuning and optimization
Demonstrated experience implementing SRE frameworks – including SLOs, SLIs, incident management, and chaos engineering
Experience leading teams supporting systems deployed across mixed infrastructure (Cloud and On‑Premise, AWS preferred)
Solid understanding of change management, risk posture, and production readiness
Strong track record of delivering automation at scale, reducing toil, and eliminating manual operational tasks
Excellent communication and stakeholder management skills, particularly under pressure
Expertise in automation (Python, Shell, PowerShell etc.)
Familiarity with observability tools and practices (metrics, logging, tracing)
Ability to lead capacity planning and scalability strategies to support growth
Knowledge of clearing and settlement processes in financial markets
Familiarity with regulatory requirements and governance frameworks in financial services
Demonstrated ability to build, mentor, and retain high‑performing SRE teams
Good communication and stakeholder management skills under pressure
Demonstrable experience managing SRE or Production Support teams in a critically important financial services environment
Experience managing teams located across multiple locations and time zones
Excellent analytical skills, Attention to detail and problem‑solving abilities
Solid technical background in the core technologies with several years of experience
Ability to communicate clearly and concisely to IT and business teams and to senior management
Ability to break down complex technical issues into easy to digest format
Familiarity with financial products and terminology

What the job involves

We are looking for a Manager – Site Reliability Engineering to strengthen the Production Management leadership team of Clearing Technology Service
You will be responsible for ensuring stability, resilience, and performance of our production systems while driving continuous improvement and SRE best practices across the platform
Assume end‑to‑end accountability for Clearing production environment, ensuring high availability, optimal performance, and robust resilience of business‑critical systems
Act as Incident Commander during major incidents, leading resolution efforts, managing stakeholder communications, and driving root cause analysis and remediation
Build and mentor a high‑performing SRE team. Promote a culture of accountability, continuous improvement, and blameless postmortems to enhance operational excellence
Ensure consistency to response and resolution SLAs. Oversee efficient ticket management and escalation processes through ServiceNow, removing blockers promptly
Develop strong partnerships across LCH and LSEG teams. Ensure timely delivery of business‑critical activities and transparent communication of risks and challenges
Monitor and analyse technical processes to identify improvement opportunities. Implement enhancements to minimise business disruption and improve operational efficiency
Ensure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks
Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data‑driven decisions and improve system health
Out of hours support / On‑call support
Be available for overnight support of production services to ensure successful completion of processing
Respond to overnight calls and deal with issues
Participate in Disaster Recovery exercises

#J-18808-Ljbffr…

Posted: May 18th, 2026