Grafana and Site Reliability Engineer

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Grafana and Site Reliability Engineer”, “description”: “

Overview

We’re hiring an experienced AWS SRE Engineer to lead observability for a cloud platform. The role focuses on building and maintaining actionable Grafana dashboards, defining and measuring reliability (SLIs/SLOs/SLAs), owning alerting strategy, and driving improvements to platform resilience. This is an opportunity to shape operational excellence and influence engineering decisions across the stack.

What you’ll do (key responsibilities)

  • Design, build and maintain Grafana dashboards that deliver actionable insights into performance, availability and capacity.
  • Implement and improve observability for AWS-hosted applications and infrastructure (metrics, logs, traces).
  • Define and track SLIs, SLOs and SLAs; manage error budgets and translate reliability targets into engineering priorities.
  • Monitor using golden signals and operate an effective, noise‑aware alerting strategy.
  • Support incident response, run RCA processes and drive continuous reliability improvements.
  • Embed observability into CI/CD and cloud operations; collaborate with platform, engineering and ops teams to improve operational efficiency.

Must‑have skills and experience

  • 6+ years in SRE, Cloud Reliability or Cloud Operations roles.
  • Strong, hands‑on AWS experience.
  • Proven expertise building Grafana dashboards and working in observability/monitoring stacks.
  • Solid understanding of SRE fundamentals (SLA, SLO, SLI, error budgets, golden signals).
  • Track record troubleshooting production systems and improving platform reliability.
  • Strong communicator and team collaborator.

Nice‑to‑have

  • Experience with Snowflake or Databricks.
  • Familiarity with IaC, automation and cloud‑native operational tooling.

#J-18808-Ljbffr”, “datePosted”: “2026-04-17”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Marks Sattin”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__407036183__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=12335” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “Glasgow” } } }
Company: Marks Sattin
Apply for the Grafana and Site Reliability Engineer
Location: Glasgow
Job Description:

Overview

We’re hiring an experienced AWS SRE Engineer to lead observability for a cloud platform. The role focuses on building and maintaining actionable Grafana dashboards, defining and measuring reliability (SLIs/SLOs/SLAs), owning alerting strategy, and driving improvements to platform resilience. This is an opportunity to shape operational excellence and influence engineering decisions across the stack.

What you’ll do (key responsibilities)

  • Design, build and maintain Grafana dashboards that deliver actionable insights into performance, availability and capacity.
  • Implement and improve observability for AWS-hosted applications and infrastructure (metrics, logs, traces).
  • Define and track SLIs, SLOs and SLAs; manage error budgets and translate reliability targets into engineering priorities.
  • Monitor using golden signals and operate an effective, noise‑aware alerting strategy.
  • Support incident response, run RCA processes and drive continuous reliability improvements.
  • Embed observability into CI/CD and cloud operations; collaborate with platform, engineering and ops teams to improve operational efficiency.

Must‑have skills and experience

  • 6+ years in SRE, Cloud Reliability or Cloud Operations roles.
  • Strong, hands‑on AWS experience.
  • Proven expertise building Grafana dashboards and working in observability/monitoring stacks.
  • Solid understanding of SRE fundamentals (SLA, SLO, SLI, error budgets, golden signals).
  • Track record troubleshooting production systems and improving platform reliability.
  • Strong communicator and team collaborator.

Nice‑to‑have

  • Experience with Snowflake or Databricks.
  • Familiarity with IaC, automation and cloud‑native operational tooling.

#J-18808-Ljbffr…

Posted: April 17th, 2026