Senior Site Reliability Engineer - Latest Jobs Near Me

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Senior Site Reliability Engineer”, “description”: “

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Role Overview

Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
Troubleshooting across the full stack, including hardware, networking, and distributed systems.
Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

Strong ownership mindset with focus on delivery and accountability.
Experience building maintainable, well-documented systems in complex environments.
Ability to operate effectively in ambiguous and rapidly evolving contexts.
Clear and effective communication skills with collaborative, low-ego approach.
5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
Strong written and verbal communication skills in English.
Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar).
Programming or scripting experience in Go, Python, or Bash.
Familiarity with infrastructure automation and infrastructure-as-code tools.
Strong technical foundation in computing or related discipline.

Preferred Experience

Experience operating large-scale machine learning or AI‑compute workloads.
Background in multi-tenant distributed systems at scale.
Hands-on experience with data centre or bare-metal infrastructure.
Knowledge of high-performance networking technologies.
Experience managing large-scale storage systems (commercial or open-source).
Competitive salary and equity package.
Retirement or pension contributions aligned with local standards.
Health coverage including medical, dental, and vision.

#J-18808-Ljbffr”, “datePosted”: “2026-04-24”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Realm”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__410644117__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }

Company: Realm

Apply for the Senior Site Reliability Engineer

Location: London

Job Description:

Role Overview

Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
Troubleshooting across the full stack, including hardware, networking, and distributed systems.
Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

Strong ownership mindset with focus on delivery and accountability.
Experience building maintainable, well-documented systems in complex environments.
Ability to operate effectively in ambiguous and rapidly evolving contexts.
Clear and effective communication skills with collaborative, low-ego approach.
5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
Strong written and verbal communication skills in English.
Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar).
Programming or scripting experience in Go, Python, or Bash.
Familiarity with infrastructure automation and infrastructure-as-code tools.
Strong technical foundation in computing or related discipline.

Preferred Experience

Experience operating large-scale machine learning or AI‑compute workloads.
Background in multi-tenant distributed systems at scale.
Hands-on experience with data centre or bare-metal infrastructure.
Knowledge of high-performance networking technologies.
Experience managing large-scale storage systems (commercial or open-source).
Competitive salary and equity package.
Retirement or pension contributions aligned with local standards.
Health coverage including medical, dental, and vision.

#J-18808-Ljbffr…

Posted: April 24th, 2026