Location: London, UK
Work Model: On-site
Role Type: Full-Time
What You’ll Do
- Design and implement reliability strategies for high‑availability production systems
- Monitor system health, performance, and uptime across cloud infrastructure
- Build automation to reduce manual operations and improve system reliability
- Develop and maintain observability systems including logging, metrics, and tracing
- Manage incident response processes and perform root cause analysis for production issues
- Improve system resilience through capacity planning, performance optimisation, and fault tolerance
- Collaborate with engineering teams to integrate reliability practices into the software development lifecycle
- Implement infrastructure automation using Infrastructure as Code
What We’re Looking For
Required Skills & Experience
- Strong experience operating production systems in cloud environments such as Amazon Web Services, Google Cloud, or Microsoft Azure
- Experience with container orchestration platforms such as Kubernetes
- Strong experience with monitoring and observability tools such as Prometheus and Grafana
- Proficiency in scripting or programming languages such as Python, Go, or Bash
- Experience implementing Infrastructure as Code with tools such as Terraform
- Strong understanding of Linux systems, networking, and distributed systems
Nice to Have
- Experience with CI/CD pipelines using platforms such as GitHub Actions or GitLab
- Familiarity with incident management frameworks and reliability engineering practices (SLIs, SLOs, error budgets)
- Experience supporting microservices architectures and high-scale systems
- Knowledge of distributed tracing and performance monitoring
#J-18808-Ljbffr…
