Senior Site Reliability Engineer

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Senior Site Reliability Engineer”, “description”: “

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Role Overview

  • Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
  • Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
  • Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

  • Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
  • Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
  • Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
  • Troubleshooting across the full stack, including hardware, networking, and distributed systems.
  • Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

  • Strong ownership mindset with focus on delivery and accountability.
  • Experience building maintainable, well-documented systems in complex environments.
  • Ability to operate effectively in ambiguous and rapidly evolving contexts.
  • Clear and effective communication skills with collaborative, low-ego approach.
  • 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
  • Strong written and verbal communication skills in English.
  • Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar).
  • Programming or scripting experience in Go, Python, or Bash.
  • Familiarity with infrastructure automation and infrastructure-as-code tools.
  • Strong technical foundation in computing or related discipline.

Preferred Experience

  • Experience operating large-scale machine learning or AI‑compute workloads.
  • Background in multi-tenant distributed systems at scale.
  • Hands-on experience with data centre or bare-metal infrastructure.
  • Knowledge of high-performance networking technologies.
  • Experience managing large-scale storage systems (commercial or open-source).
  • Competitive salary and equity package.
  • Retirement or pension contributions aligned with local standards.
  • Health coverage including medical, dental, and vision.

#J-18808-Ljbffr”, “datePosted”: “2026-04-24”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Realm”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__410644117__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }
Company: Realm
Apply for the Senior Site Reliability Engineer
Location: London
Job Description:

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Role Overview

  • Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
  • Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
  • Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

  • Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements.
  • Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors.
  • Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost.
  • Troubleshooting across the full stack, including hardware, networking, and distributed systems.
  • Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency.

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

  • Strong ownership mindset with focus on delivery and accountability.
  • Experience building maintainable, well-documented systems in complex environments.
  • Ability to operate effectively in ambiguous and rapidly evolving contexts.
  • Clear and effective communication skills with collaborative, low-ego approach.
  • 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing.
  • Strong written and verbal communication skills in English.
  • Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar).
  • Programming or scripting experience in Go, Python, or Bash.
  • Familiarity with infrastructure automation and infrastructure-as-code tools.
  • Strong technical foundation in computing or related discipline.

Preferred Experience

  • Experience operating large-scale machine learning or AI‑compute workloads.
  • Background in multi-tenant distributed systems at scale.
  • Hands-on experience with data centre or bare-metal infrastructure.
  • Knowledge of high-performance networking technologies.
  • Experience managing large-scale storage systems (commercial or open-source).
  • Competitive salary and equity package.
  • Retirement or pension contributions aligned with local standards.
  • Health coverage including medical, dental, and vision.

#J-18808-Ljbffr…

Posted: April 24th, 2026