Company: Talenzon

Apply for the Site Reliability Engineer (SRE) – Cloud Platforms

Location: London

Job Description:

Location: London, UK

Work Model: On-site

Role Type: Full-Time

Design and implement reliability strategies for high‑availability production systems
Monitor system health, performance, and uptime across cloud infrastructure
Build automation to reduce manual operations and improve system reliability
Develop and maintain observability systems including logging, metrics, and tracing
Manage incident response processes and perform root cause analysis for production issues
Improve system resilience through capacity planning, performance optimisation, and fault tolerance
Collaborate with engineering teams to integrate reliability practices into the software development lifecycle
Implement infrastructure automation using Infrastructure as Code

Strong experience operating production systems in cloud environments such as Amazon Web Services, Google Cloud, or Microsoft Azure
Experience with container orchestration platforms such as Kubernetes
Strong experience with monitoring and observability tools such as Prometheus and Grafana
Proficiency in scripting or programming languages such as Python, Go, or Bash
Experience implementing Infrastructure as Code with tools such as Terraform
Strong understanding of Linux systems, networking, and distributed systems

Experience with CI/CD pipelines using platforms such as GitHub Actions or GitLab
Familiarity with incident management frameworks and reliability engineering practices (SLIs, SLOs, error budgets)
Experience supporting microservices architectures and high-scale systems
Knowledge of distributed tracing and performance monitoring

#J-18808-Ljbffr…

Posted: June 6th, 2026