Site Reliability Engineer (SRE) – Cloud Platforms

Company: Talenzon
Apply for the Site Reliability Engineer (SRE) – Cloud Platforms
Location: London
Job Description:

Location: London, UK

Work Model: On-site

Role Type: Full-Time

What You’ll Do

  • Design and implement reliability strategies for high‑availability production systems
  • Monitor system health, performance, and uptime across cloud infrastructure
  • Build automation to reduce manual operations and improve system reliability
  • Develop and maintain observability systems including logging, metrics, and tracing
  • Manage incident response processes and perform root cause analysis for production issues
  • Improve system resilience through capacity planning, performance optimisation, and fault tolerance
  • Collaborate with engineering teams to integrate reliability practices into the software development lifecycle
  • Implement infrastructure automation using Infrastructure as Code

What We’re Looking For

Required Skills & Experience

  • Strong experience operating production systems in cloud environments such as Amazon Web Services, Google Cloud, or Microsoft Azure
  • Experience with container orchestration platforms such as Kubernetes
  • Strong experience with monitoring and observability tools such as Prometheus and Grafana
  • Proficiency in scripting or programming languages such as Python, Go, or Bash
  • Experience implementing Infrastructure as Code with tools such as Terraform
  • Strong understanding of Linux systems, networking, and distributed systems

Nice to Have

  • Experience with CI/CD pipelines using platforms such as GitHub Actions or GitLab
  • Familiarity with incident management frameworks and reliability engineering practices (SLIs, SLOs, error budgets)
  • Experience supporting microservices architectures and high-scale systems
  • Knowledge of distributed tracing and performance monitoring

#J-18808-Ljbffr…

Posted: June 6th, 2026