Location: London, 4 days in office (Oxford Street)
This role is with one of Dex’s trusted partner companies. We work closely with their teams to truly understand their culture, goals, and what they’re looking for, so we can match you with the right opportunity and give you context about the role before you commit to a process.
If you’re interested sign up to Dex to apply
Dex is an AI recruiter agent that helps you run your job search. Tell Dex your stack, seniority, and what you want to build. We will manage your applications and surface other opportunities that are a fit.
About the company
We’re working with a highly sophisticated technology-driven firm operating at the intersection of software engineering and quantitative research.
From their London headquarters, they bring together engineers and researchers to solve complex, high-stakes problems, with a strong emphasis on rigorous thinking, long-term value creation, and technical excellence.
The company is building a world-class platform designed to support advanced research and real‑time decision‑making, with engineering at the core of its success.
The opportunity
This is a high-impact role focused on production engineering within a complex, distributed environment.
You’ll be responsible for improving system reliability, observability, and operational efficiency across a federated engineering platform. The work directly supports real‑time systems where performance, stability, and change safety are critical.
The role is hands‑on and collaborative, working closely with engineering, infrastructure, and research teams to ensure systems are robust, scalable, and safe to evolve.
What you’ll do
- Improve resilience and performance of real‑time distributed systems
- Identify bottlenecks, reduce operational overhead, and limit failure impact
- Build tooling and frameworks to enable frequent, low‑risk software delivery
- Design and maintain observability systems, including metrics, alerting, and diagnostics
- Develop systems for deployment automation, runtime management, and release readiness
- Drive best practices across reliability, fault tolerance, and service ownership
- Collaborate across multiple engineering teams to align on production standards
- Participate in production support, incident response, and continuous improvement
- Work with application and research teams to define SLAs and runtime boundaries
You should have
- Strong software engineering background, ideally in distributed or real‑time systems
- Experience with containerisation and orchestration (e.g. Kubernetes)
- Familiarity with observability tooling (e.g. Prometheus, Grafana, OpenTelemetry)
- Strong debugging and problem‑solving skills in complex systems
- Experience building or contributing to high‑availability, fault‑tolerant platforms
- Understanding of CI/CD systems and deployment automation
- Ability to operate across multiple teams in a federated environment
- Focus on improving systems through engineering rather than manual processes
Tech environment
- Distributed, real‑time systems at scale
- Containerised infrastructure and orchestration
- Observability stack including metrics, tracing, and alerting systems
- CI/CD pipelines and automated deployment systems
- Strong emphasis on reliability, performance, and operational safety
Why it’s compelling
- Work on complex, high‑impact systems with real‑world consequences
- Strong engineering culture focused on quality and long‑term thinking
- High ownership and influence across platform‑level decisions
- Collaboration with highly technical engineers and researchers
- Competitive compensation with meaningful bonus structure
- Well‑supported working environment with strong benefits
If you’re interested, sign up to Dex to apply.
As part of the recruitment process at Dex, we process your personal data in accordance with our Privacy Notice for Job Applicants. This notice explains how and why your data is collected and used, and how you can contact us if you have any concerns
#J-18808-Ljbffr…
