Step into the role of Lead Site Reliability Engineer (SRE) at Barclays, where you will be a senior technical expert responsible for driving end‑to‑end resilience, reliability, and scalability across our mission‑critical virtual platform. This role focuses on ensuring systems are designed for fault tolerance, observability, and operational excellence.
You will perform deep technical reviews, troubleshoot complex issues, and define patterns for resiliency by design. As a hands‑on engineer, you will collaborate with development and production support teams, advocate chaos engineering, and build a culture of designing for failure. This position requires strong technical breadth across infrastructure, applications, networks, databases, and integrations, combined with expertise in modern reliability engineering practices.
Key Responsibilities
- Reliability Engineering: Drive strategies to improve reliability, maintainability, and scalability across platform components.
- Architecture and Design Review: Conduct deep technical assessments of system architectures, identifying risks and recommending improvements for fault tolerance and disaster recovery.
- Observability & Monitoring: Design and implement full‑stack observability solutions, including metrics, logging, distributed tracing, and alerting.
- Incident Management & Root Cause Analysis: Act as a senior escalation point for production incidents, lead RCA, and implement permanent fixes to prevent recurrence.
- Chaos Engineering & Failure Testing: Advocate and implement chaos engineering principles to validate system resilience under real‑world failure scenarios.
- Automation & Tooling: Develop automation for failover, capacity management, and self‑healing mechanisms to reduce operational risk.
- Continuous Improvement: Analyse service risk assessments and production incidents to identify systemic issues and drive long‑term improvements.
To be successful as a Lead Site Reliability Engineer (SRE), you should have experience with:
- Technical Expertise: Proven experience building and operating fault‑tolerant, highly available systems at scale.
- Architecture & Design: Strong knowledge of distributed systems, resiliency patterns (circuit breakers, retries, failover), and disaster recovery strategies.
- Problem‑Solving: Ability to troubleshoot complex technical issues across distributed systems and perform deep root cause analysis.
- Collaboration & Influence: Skilled at working with development, operations, and architecture teams to embed reliability into design and delivery.
Additional Highly Valued Skills May Include
- Understanding of cloud solutions, preferably VMWare products.
- Exposure to coding in Python.
You may be assessed on the key critical skills relevant for success in the role, such as risk and controls, change and transformation, business acumen, strategic thinking, and digital and technology, as well as job‑specific technical skills.
This role is based in Knutsford, with a hybrid working model of working a minimum of 2‑3 days per week in the office.
Purpose of the role
To apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology.
Accountabilities
- Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
- Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
- Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
- Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
- Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
- Stay informed of industry technology trends and innovations, and actively contribute to the organization’s technology communities to foster a culture of technical excellence and growth.
All colleagues will be expected to demonstrate the Barclays Values of Respect, Integrity, Service, Excellence and Stewardship – our moral compass, helping us do what we believe is right. They will also be expected to demonstrate the Barclays Mindset – to Empower, Challenge and Drive – the operating manual for how we behave.
#J-18808-Ljbffr