Site Reliability Engineer

Company: Winston Fox
Apply for the Site Reliability Engineer
Location: Greater London
Job Description:

We are looking for a highly skilled Engineer with expertise in Python programming, automation, and modern observability practices to help build and operate scalable distributed systems for an award-winning London Hedge Fund. This role sits at the intersection of platform engineering, AI tooling, and system reliability. You will design automation frameworks, develop AI-assisted engineering tools, and implement observability solutions that provide deep insights into complex distributed architectures.

Responsibilities

  • Design, develop, and maintain robust automation solutions using Python.
  • Build and maintain observability pipelines including metrics, logs, and traces across distributed systems.
  • Develop internal AI-powered tools that enhance engineering productivity and operational intelligence.
  • Implement monitoring, alerting, and diagnostics to improve system reliability, performance, and scalability.
  • Integrate observability platforms with automation workflows and incident response systems.
  • Collaborate with platform, infrastructure, data and development teams to improve system visibility and operational maturity.
  • Design tooling that enables proactive detection, analysis, and remediation of system issues across distributed environments.
  • Contribute to architecture decisions around telemetry, AI-assisted debugging, and automation frameworks.
  • Support business users and stakeholders (direct) with system analysis, problem management, and technical resolution.

Skills & Experience

  • Strong professional experience with Python development in production environments.
  • Proven experience building automation frameworks, scripts, and developer tooling.
  • Strong experience working with distributed systems and large-scale service architectures.
  • Hands-on experience working with Kubernetes in production environments.
  • Deep understanding of observability practices, including metrics, logs, tracing, and telemetry pipelines.
  • Experience integrating AI or machine learning tooling into engineering workflows.
  • Strong understanding of APIs, microservices, and containerised environments.
  • Experience with CI/CD pipelines and infrastructure automation.
  • Ability to design scalable, maintainable engineering tools.
  • Experience in supporting business users directly, project or problem coordination with dev and infra teams, project ownership experience.

Interesting Technologies

  • Observability: OpenTelemetry, Prometheus, Grafana, Elastic Stack (ELK), Jaeger
  • Automation & CI/CD: GitHub Actions, Jenkins, GitLab CI, Argo Workflows
  • Distributed Systems & Messaging: Kafka, Redis, gRPC

Offer

  • World-class technology environment (award-winning) with best-in-class engineering teams.
  • Fast-paced and low-bureaucracy culture – get stuff done mindset.
  • Up to £150,000 base salary. 50%-100% annual cash bonus. Pension, Healthcare, Gym, Food, 30 days holiday etc.
  • 4 days onsite, 1 day wfh.
  • The chance to shape the future of intelligent automation and operational insight in distributed platforms.

#J-18808-Ljbffr…

Posted: March 20th, 2026