Grafana and Site Reliability Engineer

Company: Marks Sattin

Location: Glasgow

Posted: April 17th, 2026

Overview

We’re hiring an experienced AWS SRE Engineer to lead observability for a cloud platform. The role focuses on building and maintaining actionable Grafana dashboards, defining and measuring reliability (SLIs/SLOs/SLAs), owning alerting strategy, and driving improvements to platform resilience. This is an opportunity to shape operational excellence and influence engineering decisions across the stack.

What you’ll do (key responsibilities)

Design, build and maintain Grafana dashboards that deliver actionable insights into performance, availability and capacity.
Implement and improve observability for AWS-hosted applications and infrastructure (metrics, logs, traces).
Define and track SLIs, SLOs and SLAs; manage error budgets and translate reliability targets into engineering priorities.
Monitor using golden signals and operate an effective, noise‑aware alerting strategy.
Support incident response, run RCA processes and drive continuous reliability improvements.
Embed observability into CI/CD and cloud operations; collaborate with platform, engineering and ops teams to improve operational efficiency.

Must‑have skills and experience

6+ years in SRE, Cloud Reliability or Cloud Operations roles.
Strong, hands‑on AWS experience.
Proven expertise building Grafana dashboards and working in observability/monitoring stacks.
Solid understanding of SRE fundamentals (SLA, SLO, SLI, error budgets, golden signals).
Track record troubleshooting production systems and improving platform reliability.
Strong communicator and team collaborator.

Nice‑to‑have

Experience with Snowflake or Databricks.
Familiarity with IaC, automation and cloud‑native operational tooling.

#J-18808-Ljbffr

Apply Now