Site Reliability Engineer

Company: Cubestech Ltd
Apply for the Site Reliability Engineer
Location: London
Job Description:

Primary Objectives

  • Eliminate operational toil and recurring manual work through durable automation
  • Re‑engineer support/change processes to reduce handoffs, approvals friction, and rerun complexity
  • Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering

Key Responsibilities (Automation & Process first)

  • Automation Engineering (Core)
    • Remove repetitive work: environment checks, dependency validation, actions, and standardized operation tasks
    • Create self‑service capabilities for common requests (guard‑railed, auditable, repeatable)
    • Implement automation with safety: idempotency, dry‑run modes, approval gates where needed, rollback/undo strategies, clear audit trails
  • Process Re‑engineering (Core)
    • Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign to remove waste and reduce cycle time
    • Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre‑flight controls
    • Define and track operation KPIs (toil hours removed, alert volume reduction)
  • Agentic AI
    • Design and implement agentic workflows that take action using tools/runbooks (diagnostics, evidence gathering, correlation, guided remediation, change‑risk checks, automated rerun orchestration)
    • Put strong controls in place: scoped permissions, deterministic fallbacks, human‑in‑the‑loop approvals for risky actions, evaluation harnesses and measurable outcomes
    • Productionize with monitoring, logging and post‑incident learnings feeding back into the agent/tooling
  • Observability (enablement for automation)

Required Skills & Experience

  • Senior SRE experience on distributed systems and batch/intraday workloads
  • Strong Python
  • Provable agentic AI experience showing tool integration, guard rails, evaluation approach
  • Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing lightweight controls with metrics)
  • Strong Linux and troubleshooting fundamentals across application/system/network layers
  • Experience working across mixed estates (On Prem VMs + Cloud)
  • Exposure to Banking/Finance Market Risk Domains
  • Experience and knowledge of Athena ecosystem familiarity or similar

#J-18808-Ljbffr…

Posted: May 24th, 2026