Site Reliability Engineer (Python)

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Site Reliability Engineer (Python)”, “description”: “

We need an experienced SRE to focus predominantly on automation, optimization, and process re‑engineering using AI for the Market Risk Platform. Strong Python and provable agentic AI delivery

Primary Objectives:

  • Eliminate Operational toil and recurring manual work through durable automation
  • Re-engineer support/change processes to reduce handoffs, approvals friction and rerun complexity
  • Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering

Key Responsibilities (Automation & Process first)

Automation Engineering (Core)

  • Build production grade automation in Python (tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasks
  • Create self‑service capabilities for common requests (guard‑railed, auditable, repeatable)
  • Implement “automation with safety”: idempotency, dry‑run modes, approval gates where needed, rollback/undo strategies, and clear audit trails

Process Re‑engineering (Core)

  • Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waste and reduce cycle time.
  • Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre‑flight controls
  • Defined and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction).

Agentic AI

  • Design and implement agentic workflows that take action using tools/runbooks (e.g., diagnostics, evidence gathering, correlation, guided remediation, change‑risk checks, automated rerun orchestration)
  • Put strong controls in place: deterministic fallbacks, human‑in‑the‑loop approvals for risky actions, evaluation harnesses and measurable outcomes.
  • Productionize with monitoring, logging and post‑incident learnings feeding back into the agent/tooling

Observability (enablemen for automation)

Required skills & Experience

  • Senior SRE experience on distributed systems and batch/intraday workloads in a production environment.
  • Strong Python
  • Provable agentic AI experience showing
  • Tool integration, guard rails, evaluation approach
  • Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)
  • Strong Linux and troubleshooting fundamentals across application/system/network layers
  • Experience working across mixed estates (On Pre VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)

Differentiators

  • Exposure to Banking/Finance Market Risk Domains
  • Experience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)

#J-18808-Ljbffr”, “datePosted”: “2026-04-28”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Mphasis”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__416676838__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=33” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “London” } } }
Company: Mphasis
Apply for the Site Reliability Engineer (Python)
Location: London
Job Description:

We need an experienced SRE to focus predominantly on automation, optimization, and process re‑engineering using AI for the Market Risk Platform. Strong Python and provable agentic AI delivery

Primary Objectives:

  • Eliminate Operational toil and recurring manual work through durable automation
  • Re-engineer support/change processes to reduce handoffs, approvals friction and rerun complexity
  • Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering

Key Responsibilities (Automation & Process first)

Automation Engineering (Core)

  • Build production grade automation in Python (tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasks
  • Create self‑service capabilities for common requests (guard‑railed, auditable, repeatable)
  • Implement “automation with safety”: idempotency, dry‑run modes, approval gates where needed, rollback/undo strategies, and clear audit trails

Process Re‑engineering (Core)

  • Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waste and reduce cycle time.
  • Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre‑flight controls
  • Defined and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction).

Agentic AI

  • Design and implement agentic workflows that take action using tools/runbooks (e.g., diagnostics, evidence gathering, correlation, guided remediation, change‑risk checks, automated rerun orchestration)
  • Put strong controls in place: deterministic fallbacks, human‑in‑the‑loop approvals for risky actions, evaluation harnesses and measurable outcomes.
  • Productionize with monitoring, logging and post‑incident learnings feeding back into the agent/tooling

Observability (enablemen for automation)

Required skills & Experience

  • Senior SRE experience on distributed systems and batch/intraday workloads in a production environment.
  • Strong Python
  • Provable agentic AI experience showing
  • Tool integration, guard rails, evaluation approach
  • Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)
  • Strong Linux and troubleshooting fundamentals across application/system/network layers
  • Experience working across mixed estates (On Pre VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)

Differentiators

  • Exposure to Banking/Finance Market Risk Domains
  • Experience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)

#J-18808-Ljbffr…

Posted: April 28th, 2026