Role
Role: ML Ops Engineer
Location & Working Arrangements
Location: Hybrid schedule; 2-3 days a week in the office at Thorpe Park, Leeds.
Working hours: Core hours 09:30 – 16:00; you can work around these to suit you.
Salary & Contract
Salary: £ DOE plus extensive benefits
Contract type: Permanent
Employment type: Full time
Our tech teams keep us running 24/7 to ensure world‑class service for our patients. This role may include participation in an out‑of‑hours rota as required by the business, with a fair scheduling process and additional compensation for on‑call periods.
About Us
We are the nation’s largest online pharmacy, with 25 years of experience, helping over 1.8 million patients in England manage NHS prescriptions from request through to delivery. We are Great Place to Work certified and a certified B Corp, reflecting high standards of social and environmental responsibility. Our people are fundamental to our success as we strive to be a world leading, patient‑centric digital healthcare provider and to maintain a positive, open and honest working environment.
Role Overview
The ML Ops Engineer will drive the operation of production‑grade Machine Learning and LLM services on Azure, ensuring models run as reliable, scalable, high‑performing systems. You will own the end‑to‑end MLOps/LLMOps lifecycle, leading CI/CD, deployment automation, monitoring, and incident response. You will work closely with Data Science to turn models into robust production services with governance, observability, and continuous optimisation for fast, safe, and efficient delivery at scale.
What you’ll be doing
Production Deployment & Release Engineering
- Design and operate CI/CD pipelines for ML models and LLM prompt‑flows, covering build, test, validation, deployment, and rollback
- Own model registration and promotion across environments, ensuring traceability, governance, and auditability
- Implement safe deployment strategies (blue/green, canary, champion/challenger)
- Package and deploy containerised inference services and batch pipelines, ensuring repeatability and rapid rollback
Reliability Engineering (Day 2 Operations)
- Run ML and LLM services as production‑grade systems, defining SLOs/SLIs, dashboards, and alerting
- Lead incident response for runtime issues, including triage, mitigation, recovery, and post‑incident reviews
- Develop and maintain operational runbooks covering restart, rollback, secret rotation, and safe‑mode scenarios
- Improve service resilience and reduce MTTR through automation (self‑healing, retries, fallbacks, circuit breakers)
Observability (Service, Data, Model & Cost)
- Implement monitoring for availability, latency, errors, resource usage, and job performance
- Monitor data quality including freshness, volume, completeness, schema drift, and distribution changes
- Monitor model performance, including drift and prediction distribution shifts, and track accuracy where labels exist
- Instrument LLM services for token usage, latency, and safety signals, with clear visibility into cost, quotas, and risks
LLMOps: Lifecycle, Quality & Safety
- Manage prompts and workflows as code, including versioning, code reviews, and automated regression testing
- Own production configuration for LLM deployments, including model updates, limits, and safeguards
- Partner with Data Science and Security to ensure robust safety practices, including PII protection and prompt‑injection testing
Security, Privacy & Governance
- Implement secure access controls, identity management, and secrets handling
- Support production readiness through documentation, monitoring plans, cost models, and audit evidence
- Ensure all changes follow structured governance with clear traceability and reproducibility
Who we’re looking for
- Strong Python engineering skills with experience in ML frameworks (scikit‑learn, PyTorch, TensorFlow) and experiment tracking
- Comfortable in regulated environments with privacy, auditability, change control, and handling sensitive data
- Strong DevOps/SRE background: CI/CD, Infrastructure as Code, monitoring and alerting, incident management, reliability engineering
- Hands‑on experience with Docker and Kubernetes (e.g., AKS), including debugging and performance tuning
- Experience with Azure, including Azure Machine Learning (pipelines, registries, endpoints) and Azure Monitor or Log Analytics
- Experience operationalising ML pipelines (training, batch scoring, feature engineering) and preventing training‑serving skew
- Experience implementing safe deployment practices (blue/green or canary) with automated validation
- Understanding of data contracts, schema evolution, and data quality practices, troubleshooting data drift and missing features
What happens next
Please click apply. If we think you are a good match, we will be in touch to arrange an interview. Applicants must prove they have the right to live in the UK. All successful applicants will be required to undergo a DBS check. Unsolicited agency applications will be treated as a gift.
#J-18808-Ljbffr…
