Requirements
- Proven track record architecting and delivering production ML systems at scale in enterprise environments
- Deep expertise with AWS SageMaker (training, processing, pipelines, endpoints, registry) and complementary AWS services
- Expert-level Python and ML Model frameworks (e.g. PyTorch, TensorFlow, XGBoost)
- Strong thought leadership in MLOps automation, CI/CD for ML, and model lifecycle management
- Advanced experience designing explainability systems, reason codes, and governance artefacts
- Expertise in low‑latency inference architectures and real-time model serving
- Strong grounding in drift detection, telemetry pipelines, observability patterns, and model QA
- Experience shaping ML security practices, including cross‑account IAM, data minimisation, and PII-safe design
- Ability to influence architecture, mentor senior engineers, and set long‑term technical direction
- (Desirable) Experience building or leading feature store adoption
- (Desirable) Background in ranking, search relevance, entity matching, or similarity modelling
- (Desirable) Experience designing or governing multi‑account AWS ML platforms
- (Desirable) Knowledge of distributed training, GPU/accelerator optimisation, and scaling strategies
- (Desirable) Bachelors in a STEM subject, e.g. mathematics, physics, engineering, computer science, or adjacent degrees
- (Desirable) Masters or PhD or equivalent experience in STEM desirable but not essential
What the job involves
- We are seeking a Principal Machine Learning Engineer (SageMaker, MLOps, Model Governance & Explainability) to provide technical leadership across the full lifecycle of machine learning systems powering a new matching platform
- This role is accountable for defining ML architecture, establishing engineering standards, driving MLOps maturity, and ensuring that our models are scalable, secure, explainable, and governed to enterprise‑grade standards
- You will contribute to the strategic direction of our ML platform—spanning data pipelines, model development, deployment automation, inference runtime design, telemetry, drift detection, and cross‑account productionisation
- You will mentor engineers, influence product and architectural decisions, and ensure that our ML systems operate reliably at scale, underpinned by a robust governance and compliance framework
- This is a highly hands‑on, highly technical, principal‑level role that combines architectural vision with deep practical expertise in ML engineering and AWS-native MLOps
- Define the end‑to‑end ML architecture for the matching platform, including data pipelines, model training workflows, inference runtimes, and telemetry ecosystems
- Lead adoption of best‑in‑class MLOps patterns, platform tooling, and AWS SageMaker capabilities across training, processing, registry, monitoring, and deployment
- Partner with platform, security, and data engineering teams to implement scalable data lakehouse oriented feature architectures and enterprise‑grade ML governance
- Champion engineering standards for model quality, documentation, observability, and platform resilience
- Architect highly scalable, production‑ready feature pipelines within Lakehouse environments
- Set the technical direction for fallback and resilience strategies (e.g., fallback pipelines)
- Establish and enforce data‑quality guardrails, validation schemas, and monitoring frameworks
- Drive adoption and standards for enterprise feature stores
- Lead the design of ranking, scoring, and similarity models tailored to the matching platform requirements
- Define model calibration, scoring logic, confidence thresholds, and optimisation strategies
- Mentor teams on advanced ML techniques using Model frameworks such as PyTorch, TensorFlow, and XGBoost
- Review and approve technical designs for complex modeling workflows
- Establish explainability standards across the ML stack, using SHAP or equivalent frameworks
- Define patterns to generate regulator‑ready reason codes, aligned with compliance requirements
- Ensure explainability artefacts are accurate, robust, and traceable across model versions
- Architect automated training, deployment, and retraining pipelines using AWS SageMaker
- Set standards for model registry usage, automated approvals, and rollback orchestration
- Drive infrastructure-as-code and CI/CD maturity for ML systems across multiple environments
- Lead design of enterprise‑wide weight‑update patterns and lineage‑aware deployment strategies
- Architect low‑latency, high‑throughput inference services that meet strict matching platform SLAs
- Lead the design of secure cross‑account IAM patterns for model consumption
- Own end‑to‑end telemetry design, including scoring metrics, latency, error analytics, and SLOs
- Partner with platform teams to optimise cost, scale, and reliability of inference endpoints
- Define observability standards for feature drift, concept drift, performance degradation, and data integrity
- Lead the creation of dashboards, benchmarks, and automated alerting across the ML ecosystem
- Ensure telemetry pipelines adhere to privacy, data minimisation, and compliance policies
- Drive adoption of proactive failover, shadow‑mode testing, and continuous validation patterns
- Set and enforce ML‑specific security standards including data minimisation, encryption, and PII handling
- Oversee creation of Model Cards, lineage artefacts, and compliance documentation
- Ensure ML systems meet governance standards for auditability, reproducibility, versioning, and traceability
- Collaborate with InfoSec and Risk teams to define ML governance frameworks and secure cross‑environment workflows
- Lead validation strategies using golden datasets, behavioural tests, and benchmark suites
- Architect performance testing for latency‑sensitive inference paths and model hot paths
- Establish standards for A/B testing, shadow deployments, canary rollouts, and controlled experiments
#J-18808-Ljbffr…
