Overview
We’re hiring an experienced AWS SRE Engineer to lead observability for a cloud platform. The role focuses on building and maintaining actionable Grafana dashboards, defining and measuring reliability (SLIs/SLOs/SLAs), owning alerting strategy, and driving improvements to platform resilience. This is an opportunity to shape operational excellence and influence engineering decisions across the stack.
What you’ll do (key responsibilities)
- Design, build and maintain Grafana dashboards that deliver actionable insights into performance, availability and capacity.
- Implement and improve observability for AWS-hosted applications and infrastructure (metrics, logs, traces).
- Define and track SLIs, SLOs and SLAs; manage error budgets and translate reliability targets into engineering priorities.
- Monitor using golden signals and operate an effective, noise‑aware alerting strategy.
- Support incident response, run RCA processes and drive continuous reliability improvements.
- Embed observability into CI/CD and cloud operations; collaborate with platform, engineering and ops teams to improve operational efficiency.
Must‑have skills and experience
- 6+ years in SRE, Cloud Reliability or Cloud Operations roles.
- Strong, hands‑on AWS experience.
- Proven expertise building Grafana dashboards and working in observability/monitoring stacks.
- Solid understanding of SRE fundamentals (SLA, SLO, SLI, error budgets, golden signals).
- Track record troubleshooting production systems and improving platform reliability.
- Strong communicator and team collaborator.
Nice‑to‑have
- Experience with Snowflake or Databricks.
- Familiarity with IaC, automation and cloud‑native operational tooling.
#J-18808-Ljbffr