Site Reliability Engineer III

Company: McDonald's
Apply for the Site Reliability Engineer III
Location: London
Job Description:

This opportunity is part of the Global Technology Infrastructure & Operations team (GTIO), where our mission is to deliver modern and relevant technology that supports the way McDonaldu2019s works. We provide outstanding foundational technology products and services including Global Networking, Cloud, End User Computing, and IT Service Management. Itu2019s our goal to always provide an engaging, relevant, and simple experience for our customers.

The Site Reliability Engineer (SRE) u2013 Edge Platform is a key member of the Edge Operations and SRE team within Global Technology Infrastructure & Operations. This role is responsible for ensuring the reliability, scalability, and operational excellence of the Edge computing platform that supports McDonaldu2019s global restaurant technology ecosystem.

You will work closely with Architecture, Platform Engineering, Security teams to implement observability, automation, and incident response strategies that ensure the Edge platform is resilient and maintainable. This is a unique opportunity to influence the operational maturity of a global platform and drive continuous improvement across infrastructure and services.

Responsibilities & Accountabilities:

Operate and maintain Edge platform infrastructure to ensure 24x7x365 availability, reliability, and performance.

Design and implement observability frameworks using tools such as Prometheus, Grafana, Jaeger, and Datadog.

Collaborate with Platform Engineering and Edge Solution Delivery teams to ensure platform features are operable, maintainable, and supportable in production environments.

Develop and maintain runbooks, playbooks, and automation scripts to streamline operations and reduce manual effort.

Develop and maintain runbooks, playbooks, and automation scripts to streamline operations and reduce manual toil.

Lead incident response, root cause analysis, and post-incident reviews to drive continuous improvement.

Participate in capacity planning, performance tuning, and disaster recovery exercises.

Implement and manage CI/CD pipelines and Infrastructure-as-Code (IaC) for operational tooling and automation.

Architect and maintain self-healing and auto-scaling capabilities across Edge clusters.

Partner with security teams to ensure compliance with enterprise standards and implement secure operational practices.

Contribute to platform architecture discussions with a focus on operational readiness and supportability.

Stay current with industry trends in SRE, edge computing, and distributed systems.

Skills and experience required:

Experience in Site Reliability Engineering, DevOps, or Platform Operations.

Experience supporting Edge computing or hybrid cloud environments.

Strong expertise in observability tools (Prometheus, Grafana, Jaeger, Datadog, ELK).

Experience with container orchestration platforms (Kubernetes, GKE) and virtualization technologies.

Proficiency in scripting and automation (Python, Bash, PowerShell).

Hands-on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and IaC (Terraform).

Solid understanding of cloud platforms (GCP, AWS) and distributed systems.

Strong problem-solving skills and ability to work in a fast-paced, collaborative environment.

Excellent communication and documentation skills.

GCP or AWS certification preferred.

Experience with Agile methodologies is a plus.

Requsition ID: REF9718R_744000118234232…

Posted: April 3rd, 2026