Service Reliability Engineer - Latest Jobs Near Me

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Service Reliability Engineer”, “description”: “

Requirements

A strong background in systems administration (Linux/Windows) in a large-scale environment
Proficiency in at least one programming language (e.g., Python, Go, Java)
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
Excellent communication skills and the ability to foster a collaborative team environment
(Desirable) Bachelor's degree in an IT‑related field
(Desirable) Experience managing large‑scale, distributed systems for a global organization
(Desirable) Familiarity with IT governance standards like ITIL
(Desirable) Direct experience with ServiceNow for IT service management
(Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning

What the job involves

As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
Create and maintain scripts and custom code to support and enhance our operational toolset
Support and optimize CI/CD pipelines to improve deployment speed and reliability
Incident Management & Collaboration:
Participate in an on‑call rotation to troubleshoot and mitigate production incidents
Lead post‑incident reviews and root cause analyses to implement lasting solutions
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle

#J-18808-Ljbffr”, “datePosted”: “2026-05-19”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Deepstreamtech”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__435985215__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }

Company: Deepstreamtech

Apply for the Service Reliability Engineer

Location: London

Job Description:

Requirements

A strong background in systems administration (Linux/Windows) in a large-scale environment
Proficiency in at least one programming language (e.g., Python, Go, Java)
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
Proven analytical and problem‑solving abilities with experience in a high‑pressure environment
Excellent communication skills and the ability to foster a collaborative team environment
(Desirable) Bachelor’s degree in an IT‑related field
(Desirable) Experience managing large‑scale, distributed systems for a global organization
(Desirable) Familiarity with IT governance standards like ITIL
(Desirable) Direct experience with ServiceNow for IT service management
(Desirable) Knowledge of chaos engineering, resilience testing, and advanced capacity planning

What the job involves

As a Site Reliability Engineer, you won’t just be supporting systems; you’ll be ensuring the services that connect artists and fans around the globe are always on
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling
Create and maintain scripts and custom code to support and enhance our operational toolset
Support and optimize CI/CD pipelines to improve deployment speed and reliability
Incident Management & Collaboration:
Participate in an on‑call rotation to troubleshoot and mitigate production incidents
Lead post‑incident reviews and root cause analyses to implement lasting solutions
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle

#J-18808-Ljbffr…

Posted: May 19th, 2026