Service Reliability Engineer

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Service Reliability Engineer”, “description”: “

Requirements

  • A strong background in systems administration (Linux/Windows) in a large-scale environment
  • Proficiency in at least one programming language (e.g., Python, Go, Java)
  • Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
  • Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
  • Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
  • Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
  • Excellent communication skills and the ability to foster a collaborative team environment
  • (Desirable) Bachelor's degree in an IT‑related field
  • (Desirable) Experience managing large‑scale, distributed systems for a global organization
  • (Desirable) Familiarity with IT governance standards like ITIL
  • (Desirable) Direct experience with ServiceNow for IT service management
  • (Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning

What the job involves

  • As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on
  • System Reliability & Performance:
  • Design, build, and maintain the availability, scalability, and performance of critical services
  • Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
  • Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
  • Automation & Efficiency:
  • Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
  • Create and maintain scripts and custom code to support and enhance our operational toolset
  • Support and optimize CI/CD pipelines to improve deployment speed and reliability
  • Incident Management & Collaboration:
  • Participate in an on‑call rotation to troubleshoot and mitigate production incidents
  • Lead post‑incident reviews and root cause analyses to implement lasting solutions
  • Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle

#J-18808-Ljbffr”, “datePosted”: “2026-05-19”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Deepstreamtech”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__435985215__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }
Company: Deepstreamtech
Apply for the Service Reliability Engineer
Location: London
Job Description:

Requirements

  • A strong background in systems administration (Linux/Windows) in a large-scale environment
  • Proficiency in at least one programming language (e.g., Python, Go, Java)
  • Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
  • Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
  • Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
  • Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
  • Excellent communication skills and the ability to foster a collaborative team environment
  • (Desirable) Bachelor’s degree in an IT‑related field
  • (Desirable) Experience managing large‑scale, distributed systems for a global organization
  • (Desirable) Familiarity with IT governance standards like ITIL
  • (Desirable) Direct experience with ServiceNow for IT service management
  • (Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning

What the job involves

  • As a Site Reliability Engineer, you won’t just be supporting systems; you’ll be ensuring the services that connect artists and fans around the globe are always on
  • System Reliability & Performance:
  • Design, build, and maintain the availability, scalability, and performance of critical services
  • Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
  • Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
  • Automation & Efficiency:
  • Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
  • Create and maintain scripts and custom code to support and enhance our operational toolset
  • Support and optimize CI/CD pipelines to improve deployment speed and reliability
  • Incident Management & Collaboration:
  • Participate in an on‑call rotation to troubleshoot and mitigate production incidents
  • Lead post‑incident reviews and root cause analyses to implement lasting solutions
  • Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle

#J-18808-Ljbffr…

Posted: May 19th, 2026