An established technology-driven organisation is seeking an experienced Site Reliability Engineer (SRE) in Glasgow to strengthen and scale their cloud-native data platform, utilising AWS, Snowflake, and Databricks. This position offers the opportunity to drive automation, resilience, and operational excellence across critical data services.
Key Responsibilities:
- Automate infrastructure provisioning and platform operations using Infrastructure as Code and CI/CD tools.
- Lead and execute reliability initiatives including disaster recovery planning, failure testing, and resilience validation.
- Define and manage service health metrics (SLIs/SLOs/SLAs) to drive measurable improvements in reliability.
- Build observability solutions to monitor AWS, Snowflake, and Databricks workloads.
- Collaborate with engineering teams to embed reliability best practices throughout platform development.
- Analyse incidents and proactively address root causes to improve availability and performance.
- Provide operational support, drive incident resolution, and implement automated fixes for recurring issues.
Requirements:
- Strong knowledge of SRE principles and practical experience defining SLAs, SLOs, and error budgets.
- Demonstrated AWS expertise (e.g., EC2, S3, IAM, VPC, CloudWatch) in production environments.
- Experience with observability tools, monitoring, and alerting practices.
- Proficient in automation, Infrastructure as Code (Terraform, CloudFormation, or CDK), and scripting (Python/Bash).
- Exposure to Snowflake and/or Databricks data platforms.
- Background in DR/chaos engineering, CI/CD pipelines, GitOps, or supporting large-scale data environments.
#J-18808-Ljbffr…
