Senior Site Reliability Engineer

Company: SoCode Recruitment
Apply for the Senior Site Reliability Engineer
Location: Cambridge
Job Description:

We’re hiring a Senior Site Reliability Engineer to help strengthen the reliability, observability and operational maturity of a cloud‑native SaaS platform operating within a regulated environment.

This is a hands‑on role focused on production systems, monitoring, incident response, automation and operational excellence across a Kubernetes‑based AWS platform.

You’ll work closely with Platform Engineering and Application teams to improve system health, reduce operational risk and build scalable reliability practices as the business continues to grow.

Key responsibilities

  • Building and improving observability across metrics, logs and traces
  • Developing actionable dashboards, alerts, runbooks and operational tooling
  • Supporting production systems, incident response and root cause analysis
  • Improving reliability, resilience, deployment feedback loops and operational readiness
  • Identifying operational inefficiencies and automating repetitive toil
  • Driving post‑incident reviews and long‑term corrective improvements
  • Helping define SLOs, SLIs and reliability standards across customer‑critical services

Tech environment includes

  • AWS
  • Kubernetes / EKS
  • Observability
  • Prometheus
  • Grafana
  • OpenTelemetry
  • GitOps
  • Argo CD
  • CI/CD
  • Cloud Operations

We’re looking for someone with

  • Strong experience supporting Kubernetes‑based production environments
  • Practical AWS and cloud‑native infrastructure knowledge
  • Experience with observability, monitoring and incident management
  • Strong scripting or automation capability (Python, Go, Bash, TypeScript etc.)
  • Calm, pragmatic thinking during live operational incidents
  • Passion for improving reliability and reducing operational noise
  • Experience within SaaS, fintech or regulated environments would be highly beneficial

This is an excellent opportunity for an engineer who enjoys solving real production challenges, improving operational resilience and building mature SRE practices within a scaling engineering organisation.

#J-18808-Ljbffr…

Posted: June 11th, 2026