Job Description
How You’ll Make an ImpactA subsidiary of Publicis Groupe, Epsilon is a leading provider of multi‑channel marketing services, technologies, and database solutions. We do more than collect and store data, and we might be the most important Internet company you’ve never heard of. Join our team for your chance to work in the digital marketing space and solve meaningful problems on a massive scale—and have fun doing it.
The System and Platform Operations Manager is a technical leadership role that is responsible for the support, reliability and stability of Epsilon Retail Media production systems, environments, and offerings. The team owns the reliability vision for the company, driving continuous improvement through a combination of development and operations initiatives as well as process excellence. This position and its team have solid‑line responsibility for operations including deployment, management, monitoring, reporting, troubleshooting, and repair of production systems. Core to the success of the role is to provide a premium customer support experience focused on a “center of excellence” that allows for a full‑service delivery support cycle. This role is responsible for managing the Platform Operations Team centralized within a single geo‑region, orchestrating regional teamwork, serving with both technical and professional support, and championing the company values. The Platform Operations Engineer works closely with the Engineering team to ensure ongoing system stability and supports the Technical Account Managers from an environmental perspective. The Platform Operations team is responsible for supporting all retailers once they are live. Critically important is how this team collaborates and liaises with other teams such as Customer Support, Technical Account Management, Engineering and Customer Success teams.
Responsibilities
What you’ll do:
- Establish and manage operational practices and ensure we design, implement and operate a support model that is fit for purpose for our future.
- Adopt a “Measure Everything” approach to ensure that internal service level objectives and customer service level agreements are exceeded, including executive‑level reporting on operational health metrics such as SLAs, incident resolution, performance, availability, reliability, capacity, etc.
- Take ownership of complex issues related to performance, reliability, and scalability and lead resolution of serious incidents and events, including communications with customers and wider stakeholders.
- Provide insight and expertise on how customers will perceive the changes or impacts to customers to drive customer organization change management and communication.
- Empower the Delivery teams to release new products, features, updates and fixes quickly, while ensuring platforms remain reliable and stable.
- Work with the wider Engineering, Product, Delivery and Security teams to ensure that appropriate attention is given to production/system reliability.
- Identify the capabilities needed to meet current and emerging business needs of a significant function.
- As subject‑matter expert on the team, maintain understanding of current technology, database management, reliability practices, and future trends through ongoing education, conference attendance and industry press.
Qualifications
Who You Are:
- At least 5 years of hands‑on experience in Site Reliability focused positions.
- Strong knowledge of containerization technologies (Docker, Kubernetes).
- Experience with infrastructure as code (Terraform).
- Solid understanding of networking, security, and system architecture.
- Proficient in scripting languages (Java, Golang, Python, Bash, or similar).
- Experience with monitoring and observability tools (DataDog, Prometheus, Grafana).
- Knowledge of database management systems (PostgreSQL, Bigtable).
- Understanding of API and microservices architecture.
- Strong people‑leadership skills with at least a year of leading and driving high‑performance technical teams.
- Experience with operations teams within enterprise environments with knowledge of DevOps, ITIL, Cloud Services, IT Infrastructure and Operations supporting and maintaining production and development environments and building cloud services that are secure, reliable, scalable and observable.
- Experience with establishing Service Delivery strategies that align to new ways of work methods, including Agile.
- Experience establishing and delivering IT support services in a high‑availability (HA) environment such as 24/7 operations.
Additional Information
We know that we have some of the brightest and most talented employees in the world, and we believe in rewarding them accordingly. If you work here, expect competitive compensation, a great benefits package and endless opportunities to advance your career.
We offer hybrid working opportunities, with our office space located in the Iconic Television Centre, White City.
As part of our dedication to enhance our inclusive and diverse workforce, Epsilon is committed to equal access to opportunity for people without regard to race, age, sex, disability, neurodiversity, sexual orientation, gender identity, pregnancy and maternity, marriage and civil partnership, or religion or belief. We are committed to providing reasonable adjustments for candidates in our application process.
#J-18808-Ljbffr”, “datePosted”: “2026-05-20”, “hiringOrganization”: { “@type”: “Organization”, “name”: “UNAVAILABLE”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__437008215__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }Job Description
How You’ll Make an ImpactA subsidiary of Publicis Groupe, Epsilon is a leading provider of multi‑channel marketing services, technologies, and database solutions. We do more than collect and store data, and we might be the most important Internet company you’ve never heard of. Join our team for your chance to work in the digital marketing space and solve meaningful problems on a massive scale—and have fun doing it.
The System and Platform Operations Manager is a technical leadership role that is responsible for the support, reliability and stability of Epsilon Retail Media production systems, environments, and offerings. The team owns the reliability vision for the company, driving continuous improvement through a combination of development and operations initiatives as well as process excellence. This position and its team have solid‑line responsibility for operations including deployment, management, monitoring, reporting, troubleshooting, and repair of production systems. Core to the success of the role is to provide a premium customer support experience focused on a “center of excellence” that allows for a full‑service delivery support cycle. This role is responsible for managing the Platform Operations Team centralized within a single geo‑region, orchestrating regional teamwork, serving with both technical and professional support, and championing the company values. The Platform Operations Engineer works closely with the Engineering team to ensure ongoing system stability and supports the Technical Account Managers from an environmental perspective. The Platform Operations team is responsible for supporting all retailers once they are live. Critically important is how this team collaborates and liaises with other teams such as Customer Support, Technical Account Management, Engineering and Customer Success teams.
Responsibilities
What you’ll do:
- Establish and manage operational practices and ensure we design, implement and operate a support model that is fit for purpose for our future.
- Adopt a “Measure Everything” approach to ensure that internal service level objectives and customer service level agreements are exceeded, including executive‑level reporting on operational health metrics such as SLAs, incident resolution, performance, availability, reliability, capacity, etc.
- Take ownership of complex issues related to performance, reliability, and scalability and lead resolution of serious incidents and events, including communications with customers and wider stakeholders.
- Provide insight and expertise on how customers will perceive the changes or impacts to customers to drive customer organization change management and communication.
- Empower the Delivery teams to release new products, features, updates and fixes quickly, while ensuring platforms remain reliable and stable.
- Work with the wider Engineering, Product, Delivery and Security teams to ensure that appropriate attention is given to production/system reliability.
- Identify the capabilities needed to meet current and emerging business needs of a significant function.
- As subject‑matter expert on the team, maintain understanding of current technology, database management, reliability practices, and future trends through ongoing education, conference attendance and industry press.
Qualifications
Who You Are:
- At least 5 years of hands‑on experience in Site Reliability focused positions.
- Strong knowledge of containerization technologies (Docker, Kubernetes).
- Experience with infrastructure as code (Terraform).
- Solid understanding of networking, security, and system architecture.
- Proficient in scripting languages (Java, Golang, Python, Bash, or similar).
- Experience with monitoring and observability tools (DataDog, Prometheus, Grafana).
- Knowledge of database management systems (PostgreSQL, Bigtable).
- Understanding of API and microservices architecture.
- Strong people‑leadership skills with at least a year of leading and driving high‑performance technical teams.
- Experience with operations teams within enterprise environments with knowledge of DevOps, ITIL, Cloud Services, IT Infrastructure and Operations supporting and maintaining production and development environments and building cloud services that are secure, reliable, scalable and observable.
- Experience with establishing Service Delivery strategies that align to new ways of work methods, including Agile.
- Experience establishing and delivering IT support services in a high‑availability (HA) environment such as 24/7 operations.
Additional Information
We know that we have some of the brightest and most talented employees in the world, and we believe in rewarding them accordingly. If you work here, expect competitive compensation, a great benefits package and endless opportunities to advance your career.
We offer hybrid working opportunities, with our office space located in the Iconic Television Centre, White City.
As part of our dedication to enhance our inclusive and diverse workforce, Epsilon is committed to equal access to opportunity for people without regard to race, age, sex, disability, neurodiversity, sexual orientation, gender identity, pregnancy and maternity, marriage and civil partnership, or religion or belief. We are committed to providing reasonable adjustments for candidates in our application process.
#J-18808-Ljbffr…
